Apache® Spark™ News

The Spark Certified Developer program

More and more companies are using Apache Spark, and many Spark based pilots are currently deploying in production. In social media, at every big data conference or meetup, people describe new POC, prototypes, and production deployments using Spark.

Spark officially sets a new record in large-scale sorting

A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system can sort 100 TB of data (1 trillion records). Today, we are happy to announce that our entry has been reviewed by the benchmark committee and we have officially won the Daytona GraySort contest!

Spark the fastest open source engine for sorting a petabyte

Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that some organizations were having trouble pushing it to large scale datasets that could not fit in memory. Therefore, since the inception of Databricks, we have devoted much effort, together with the Spark community, to improve the stability, scalability, and performance of Spark. Spark works well for gigabytes or terabytes of data, and it should also work well for petabytes.

Spark as a platform for large-scale neuroscience

The brain is the most complicated organ of the body, and probably one of the most complicated structures in the universe. It’s millions of neurons somehow work together to endow organisms with the extraordinary ability to interact with the world around them. Things our brains control effortlessly — kicking a ball, or reading and understanding this sentence — have proven extremely hard to implement in a machine.

Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark

Two powerful features of Apache Spark include its native APIs provided in Scala, Java and Python, and its compatibility with any Hadoop-based input or output source. This language support means that users can quickly become proficient in the use of Spark even without experience in Scala, and furthermore can leverage the extensive set of third-party libraries available (for example, the many data analysis libraries for Python).

Spark 1.1: The State of Spark Streaming

With Spark 1.1 recently released, we’d like to take this occasion to feature one of the most popular Spark components – Spark Streaming – and highlight who is using Spark Streaming and why.

Announcing Spark 1.1

Today we’re thrilled to announce the release of Spark 1.1! Spark 1.1 introduces many new features along with scale and stability improvements. This post will introduce some key features of Spark 1.1 and provide context on the priorities of Spark for this and the next release. Read more

Statistics Functionality in Spark 1.1

One of our philosophies in Spark is to provide rich and friendly built-in libraries so that users can easily assemble data pipelines. With Spark, and MLlib in particular, quickly gaining traction among data scientists and machine learning practitioners, we’re observing a growing demand for data analysis support outside of model fitting. To address this need, we have started to add scalable implementations of common statistical functions to facilitate various components of a data pipeline. Read more

Shark, Spark SQL, Hive on Spark, and the future of SQL on Spark

With the introduction of Spark SQL and the new Hive on Spark effort (HIVE-7292), we get asked a lot about our position in these two projects and how they relate to Shark. At the Spark Summit today, we announced that we are ending development of Shark and will focus our resources towards Spark SQL, which will provide a superset of Shark’s features for existing Shark users to move forward. In particular, Spark SQL will provide both a seamless upgrade path from Shark 0.9 server and new features such as integration with general Spark programs.