Apache® Spark™ News

Announcing SparkR: R on Spark

I am excited to announce that the upcoming Apache Spark 1.4 release will include SparkR, an R package that allows data scientists to analyze large datasets and interactively run jobs on them from the R shell.

Statistical and Mathematical Functions with DataFrames in Spark

We introduced DataFrames in Spark 1.3 to make Apache Spark much easier to use. Inspired by data frames in R and Python, DataFrames in Spark expose an API that’s similar to the single-node data tools that data scientists are already familiar with. Statistics is an important part of everyday data science. We are happy to announce improved support for statistical and mathematical functions in the upcoming 1.4 release.

Databricks Launches MOOC: Data Science on Spark

For the past several months, we have been working in collaboration with professors from the University of California Berkeley and University of California Los Angeles to produce two freely available Massive Open Online Courses (MOOCs). We are proud to announce that both MOOCs will launch in June on the edX platform!

Spark Summit 2015 in San Francisco is just around the corner!

We’re proud to announce that the new Spark Summit website is live! This includes the full list of community talks along with the first set of keynotes. With over 260 submissions this year, the Program Committee had its work cut out narrowing the list to 54 talks. We would like to thank everyone who submitted and invite everyone to submit a presentation for a future Spark Summit (just in case you have not heard as yet, we are taking the Spark Summit to Amsterdam this fall.).

Project Tungsten: Bringing Spark Closer to Bare Metal

In a previous blog post, we looked back and surveyed performance improvements made to Spark in the past year. In this post, we look forward and share with you the next chapter, which we are calling Project Tungsten. 2014 witnessed Spark setting the world record in large-scale sorting and saw major improvements across the entire engine from Python to SQL to machine learning. Performance optimization, however, is a never ending process.