Apache® Spark™ News

Query Watchdog: Handling Disruptive Queries in Spark SQL

At Databricks, our users range from SQL Analysts who explore data through JDBC connections and SQL Notebooks to Data Engineers who orchestrate large scale ETL jobs. While this is great for data democratization, one challenge associated with exploratory data analysis is handling rogue queries that appear as if they will finish, but never actually will. These queries can be extremely slow, saturate cluster resources, and deprive others to share the same cluster.

Analyze one year of radio station songs aired with Apache Spark SQL, Apache Spark, Spotify, and Databricks

This article will present the year 2016 for 4 main french radio stations through fun SQL queries, then we will connect each song to the Spotify API to create the radio stations’ musical profile. We will use the Databricks community version to visualize our data. All SQL queries and all results are available on this notebook. It’s the “backstage” of this article, where the magic happens if we can say.

Scaling Spark in the real world: performance and usability

A short and easy paper from the Databricks team to end the week. Given the pace of development in the Apache Spark world, a paper published in 2015 about enhancements to Spark will of course be a little dated. But this paper nicely captures some of the considerations in the transition from research project to commercial software – we see two years of that journey.

Apache Spark: A Unified Engine for Big Data Processing

The growth of data volumes in industry and research poses tremendous opportunities, as well as tremendous computational challenges. As data sizes have outpaced the capabilities of single machines, users have needed new systems to scale out computations to multiple nodes. As a result, there has been an explosion of new cluster programming models targeting diverse computing workloads.

3 Things that Excite Spark’s Creator Coming in Spark 2.0

Matei Zaharia, the creator of Apache Spark, recently detailed three "exciting" improvements to the open source Big Data analytics project coming soon in version 2. Zaharia started the whole Spark thing pursuing his PhD at UC Berkeley. He's now an assistant professor of computer science at MIT and the CTO of Databricks Inc., a company he co-founded that now serves as the commercial steward of the popular data processing engine.

Spark 2.0 to Introduce New ‘Structured Streaming’ Engine

The folks at Databricks last week gave a glimpse of what’s to come in Spark 2.0, and among the changes that are sure to capture the attention of Spark users is the new Structured Streaming engine that leans on the Spark SQL API to simplify the development of real-time, continuous big data apps. In his keynote at Spark Summit East last week, Spark creator Matei Zaharia said the new Structured Streaming API that will debut later this year in Spark 2.0 will enable the creation of applications that combine real-time, interactive, and batch components.