Apache® Spark™ News

Apache Spark selected for Infoworld 2015 Technology of the Year Award

Recently Infoworld unveiled the 2015 Technology of the Year Award winners, which range from open source software to stellar consumer technologies like the iPhone.  Being the creators and driving force behind Spark, Databricks is thrilled to see Spark in their ranks.  In fact, we built our flagship product, Databricks Cloud, on top of Spark with the ambition to revolutionize big data processing in ways similar to how iPhone revolutionized the mobile experience.

An introduction to JSON support in Spark SQL

In this blog post, we introduce Spark SQL’s JSON support, a feature we have been working on at Databricks to make it dramatically easier to query and create JSON data in Spark. With the prevalence of web and mobile applications, JSON has become the de-facto interchange format for web service API’s as well as long-term storage. With existing tools, users often engineer complex pipelines to read and write JSON data sets within analytical systems. Spark SQL’s JSON support, released in version 1.1 and enhanced in Spark 1.2, vastly simplifies the end-to-end-experience of working with JSON data.

Spark Summit East 2015 Agenda is Now Available

We are thrilled to announce the availability of the agenda for Spark Summit East 2015! This inaugural New York City event on March 18-19, 2015 has over thirty jam-packed sessions – offering a combination of longer deep-dive presentations and shorter intensive talks. You will have the opportunity to engage the speakers and your peers in discussion and a cross-pollination of ideas.

Improved Fault-tolerance and Zero Data Loss in Spark Streaming

Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system. Since its beginning, Spark Streaming has included support for recovering from failures of both driver and worker machines. However, for some data sources, input data could get lost while recovering from the failures. In Spark 1.2, we have added preliminary support for write ahead logs (also known as journaling) to Spark Streaming to improve this recovery mechanism and give stronger guarantees of zero data loss for more data sources. In this blog, we are going to elaborate on how this feature works and how developers can enable it to get those guarantees in Spark Streaming applications.

Spark SQL Data Sources API: Unified Data Access for the Spark Platform

Since the inception of Spark SQL in Spark 1.0, one of its most popular uses has been as a conduit for pulling data into the Spark platform.  Early users loved Spark SQL’s support for reading data from existing Apache Hive tables as well as from the popular Parquet columnar format. We’ve since added support for other formats, such as JSON.  In Spark 1.2, we’ve taken the next step to allow Spark to integrate natively with a far larger number of input sources.  These new integrations are made possible through the inclusion of the new Spark SQL Data Sources API.

Announcing Spark Packages

Today, we are happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages.

Announcing Spark 1.2

We at Databricks are thrilled to announce the release of Spark 1.2! Spark 1.2 introduces many new features along with scalability, usability and performance improvements. This post will introduce some key features of Spark 1.2 and provide context on the priorities of Spark for this and the next release. In the next two weeks, we’ll be publishing blog posts with more details on feature additions in each of the major components. Spark 1.2 has been posted today on the Apache Spark website.

Samsung SDS uses Spark for prescriptive analytics at large scale

Samsung SDS is the business and IT solutions arm of Samsung Group. A global ICT service provider with over 17,000 employees worldwide and 6.7 billion USD in revenues, Samsung SDS tackles the challenges of some of the largest global enterprises in such industries as manufacturing, financial services, health care and retail.