Apache® Spark™ News

Deep Dive into Spark SQL’s Catalyst Optimizer

Spark SQL is one of the newest and most technically involved components of Spark. It powers both SQL queries and the new DataFrame API. At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala’s pattern matching and quasiquotes) in a novel way to build an extensible query optimizer.

A Look Back at Spark Summit East

We are delighted about the success of the first Spark Summit East, held in New York City on March 18th. The summit was attended by a sold-out crowd of over 900 people from more than 300 organizations.

Spark 2.0: Rearchitecting Spark for Mobile Platforms

Yesterday, to celebrate Spark’s 5 year old birthday, we looked back at the history of the project. Today, we are happy to announce the next major chapter of Spark development: an architectural overhaul designed to enable Spark on mobile devices. Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Already today, 100% of Spark’s users have mobile phones.

Spark Turns Five Years Old!

Today, we’re celebrating an important milestone for the Spark project — it’s now been five years since Spark was first open sourced. When we first decided to release our research code at UC Berkeley, none of us knew how far Spark would make it, but we believed we had built some really neat technology that we wanted to share with the world. In the five years since, we’ve been simply awed by the numerous contributors and users that have made Spark the leading-edge computing framework it is today. Indeed, to our knowledge, Spark has now become the most active open source project in big data (looking at either contributors per month or commits per month). In addition to contributors, it has built up an array of hundreds of production use cases from batch analytics to stream processing.

Topic modeling with LDA: MLlib meets GraphX

With Spark 1.3, MLlib now supports Latent Dirichlet Allocation (LDA), one of the most successful topic models. LDA is also the first MLlib algorithm built upon GraphX. In this blog post, we provide an overview of LDA and its use cases, and we explain how GraphX was a natural choice for implementation.

What’s new for Spark SQL in Spark 1.3

The Spark 1.3 release represents a major milestone for Spark SQL.  In addition to several major features, we are very excited to announce that the project has officially graduated from Alpha, after being introduced only a little under a year ago.  In this blog post we will discuss exactly what this step means for compatibility moving forward, as well as highlight some of the major features of the release.

Announcing Spark 1.3!

Today I’m excited to announce the general availability of Spark 1.3! Spark 1.3 introduces the widely anticipated DataFrame API, an evolution of Spark’s RDD abstraction designed to make crunching large datasets simple and fast. Spark 1.3 also boasts a large number of improvements across the stack, from Streaming, to ML, to SQL. The release has been posted today on the Apache Spark website.