At Databricks, our users range from SQL Analysts who explore data through JDBC connections and SQL Notebooks to Data Engineers who orchestrate large scale ETL jobs. While this is great for data democratization, one challenge associated with exploratory data analysis is handling rogue queries that appear as if they will finish, but never actually will. These queries can be extremely slow, saturate cluster resources, and deprive others to share the same cluster.
Apache® Spark™ News
This article will present the year 2016 for 4 main french radio stations through fun SQL queries, then we will connect each song to the Spotify API to create the radio stations’ musical profile. We will use the Databricks community version to visualize our data. All SQL queries and all results are available on this notebook. It’s the “backstage” of this article, where the magic happens if we can say.
In 2016, Apache Spark released its second major version 2.0 and outgrew our wildest expectations: 4X growth in meetup members reaching 240,000 globally, and 2X growth in code contributors reaching 1000.
A short and easy paper from the Databricks team to end the week. Given the pace of development in the Apache Spark world, a paper published in 2015 about enhancements to Spark will of course be a little dated. But this paper nicely captures some of the considerations in the transition from research project to commercial software – we see two years of that journey.
The growth of data volumes in industry and research poses tremendous opportunities, as well as tremendous computational challenges. As data sizes have outpaced the capabilities of single machines, users have needed new systems to scale out computations to multiple nodes. As a result, there has been an explosion of new cluster programming models targeting diverse computing workloads.
In an interview with SiliconANGLE at the summit, Zaharia provided a primer on Spark, explained how he hopes to make it accessible to more mainstream business analysts, and gave his view on how open source business models are evolving. This is an edited version of the conversation
An overview of 13 core Apache Spark concepts, presented with focus and clarity in mind. A great beginner's overview of essential Spark terminology.
Thanks to an impressive grab bag of improvements in version 2.0, Spark's quasi-streaming solution has become more powerful and easier to manage.
Matei Zaharia, the creator of Apache Spark, recently detailed three "exciting" improvements to the open source Big Data analytics project coming soon in version 2. Zaharia started the whole Spark thing pursuing his PhD at UC Berkeley. He's now an assistant professor of computer science at MIT and the CTO of Databricks Inc., a company he co-founded that now serves as the commercial steward of the popular data processing engine.
The folks at Databricks last week gave a glimpse of what’s to come in Spark 2.0, and among the changes that are sure to capture the attention of Spark users is the new Structured Streaming engine that leans on the Spark SQL API to simplify the development of real-time, continuous big data apps. In his keynote at Spark Summit East last week, Spark creator Matei Zaharia said the new Structured Streaming API that will debut later this year in Spark 2.0 will enable the creation of applications that combine real-time, interactive, and batch components.