Spark SQL is one of the newest and most technically involved components of Spark. It powers both SQL queries and the new DataFrame API. At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala’s pattern matching and quasiquotes) in a novel way to build an extensible query optimizer.
Apache® Spark™ News
We are delighted about the success of the first Spark Summit East, held in New York City on March 18th. The summit was attended by a sold-out crowd of over 900 people from more than 300 organizations.
Yesterday, to celebrate Spark’s 5 year old birthday, we looked back at the history of the project. Today, we are happy to announce the next major chapter of Spark development: an architectural overhaul designed to enable Spark on mobile devices. Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Already today, 100% of Spark’s users have mobile phones.
Today, we’re celebrating an important milestone for the Spark project — it’s now been five years since Spark was first open sourced. When we first decided to release our research code at UC Berkeley, none of us knew how far Spark would make it, but we believed we had built some really neat technology that we wanted to share with the world. In the five years since, we’ve been simply awed by the numerous contributors and users that have made Spark the leading-edge computing framework it is today. Indeed, to our knowledge, Spark has now become the most active open source project in big data (looking at either contributors per month or commits per month). In addition to contributors, it has built up an array of hundreds of production use cases from batch analytics to stream processing.
In this article, we are going to discuss these improvements in more detail.
With Spark 1.3, MLlib now supports Latent Dirichlet Allocation (LDA), one of the most successful topic models. LDA is also the first MLlib algorithm built upon GraphX. In this blog post, we provide an overview of LDA and its use cases, and we explain how GraphX was a natural choice for implementation.
The Spark 1.3 release represents a major milestone for Spark SQL. In addition to several major features, we are very excited to announce that the project has officially graduated from Alpha, after being introduced only a little under a year ago. In this blog post we will discuss exactly what this step means for compatibility moving forward, as well as highlight some of the major features of the release.
This is a guest blog from Matt Kalan, a Senior Solution Architect at MongoDB
Today I’m excited to announce the general availability of Spark 1.3! Spark 1.3 introduces the widely anticipated DataFrame API, an evolution of Spark’s RDD abstraction designed to make crunching large datasets simple and fast. Spark 1.3 also boasts a large number of improvements across the stack, from Streaming, to ML, to SQL. The release has been posted today on the Apache Spark website.
Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience.