Apache® Spark™ News

Announcing Spark 1.0

Today, we’re very proud to announce the release of Apache Spark 1.0. Spark 1.0 is a major milestone for the Spark project that brings both numerous new features and strong API compatibility guarantees. The release is also a huge milestone for the Spark developer community: with more than 110 contributors over the past 4 months, it is Spark’s largest release yet, continuing a trend that has quickly made Spark the most active project in the Hadoop ecosystem.

Making Spark Easier to Use in Java with Java 8

One of Spark’s main goals is to make big data applications easier to write. Spark has always had concise APIs in Scala and Python, but its Java API was verbose due to the lack of function expressions. With the addition of lambda expressions in Java 8, we’ve updated Spark’s API to transparently support these expressions, while staying compatible with old versions of Java. This new support will be available in Spark 1.0.

Spark 0.9.1 Released

We are happy to announce the availability of Spark 0.9.1! This is a maintenance release with bug fixes, performance improvements, better stability with YARN and improved parity of the Scala and Python API. We recommend all 0.9.0 users to upgrade to this stable release.

Spark Now a Top-level Apache Project

We are delighted with the recent announcement of the Apache Software Foundation that Spark has become a top-level Apache project. This is a recognition of the fantastic work done by the Spark open source community, which now counts over 140 developers from 30+ companies.

Spark 0.9.0 Released

Our goal with Apache Spark is very simple: provide the best platform for computation on big data. We do this through both a powerful core engine and rich libraries for useful analytics tasks. Today, we are excited to announce the release of Spark 0.9.0. This major release extends Spark’s libraries and further improves its performance and usability. Spark 0.9.0 is the largest release to date, with work from 83 contributors, who submitted over 300 patches.

Spark In MapReduce (SIMR)

Hadoop integration has always been a key goal of Spark and YARN users have long been able to run Spark on YARN. However, up to now, it has been relatively hard to run Apache Spark on Hadoop MapReduce v1 clusters, i.e. clusters that do not have YARN installed. Typically, users would have to get permission to install Spark/Scala on some subset of the machines, a process that could be time consuming. Enter SIMR (Spark In MapReduce), which has been released in conjunction with Spark 0.8.1.

Spark 0.8.1 Released

We are happy to announce the release of Apache Spark 0.8.1. In addition to performance and stability improvements, this release adds three new features. First, Spark now supports for the newest versions of YARN (2.2+). Second, the standalone cluster manager supports a high-availability mode in which it can tolerate master failures. Third, shuffles have been optimized to create fewer files, improving shuffle performance drastically in some settings.

Putting Spark to Use – Fast In-Memory Computing for Your Big Data Applications

Apache Hadoop has revolutionized big data processing, enabling users to store and process huge amounts of data at very low costs. MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through system logs, running ETL, computing web indexes, and powering personal recommendation systems. However, its reliance on persistent storage to provide fault tolerance and its one-pass computation model make MapReduce a poor fit for low-latency applications and iterative computations, such as machine learning and graph algorithms.