Apache® Spark™ News

Announcing Databricks Runtime 4.2!

We’re excited to announce Databricks Runtime 4.2, powered by Apache Spark™.  Version 4.2 includes updated Spark internals, new features, and major performance upgrades to Databricks Delta, as well as general quality improvements to the platform.  We are moving quickly toward the Databricks Delta general availability (GA) release and we recommend you upgrade to Databricks Runtime 4.2 to take advantage of these improvements.

How to Use MLflow, TensorFlow, and Keras with PyCharm

At Spark + AI Summit in June, we announced MLflow, an open-source platform for the complete machine learning cycle. The platform’s philosophy is simple: work with any popular machine learning library; allow machine learning developers to experiment with their models, preserve the training environment, parameters, and dependencies, and reproduce their results; and finally deploy, monitor and serve them seamlessly—all in an open manner with limited constraints.

Analyze Games from European Soccer Leagues with Apache Spark and Databricks

The global sports market is huge, comprised of players, teams, leagues, fan clubs, sponsors, etc., and all of these entities interact in myriad ways generating an enormous amount of data. Some of that data is used internally to help make better decisions, and there are a number of use cases within the media industry that use the same data to create better products and attract/retain viewers.

MLflow 0.2 Released

At this year’s Spark+AI Summit, we introduced MLflow, an open source platform to simplify the machine learning lifecycle. In the 3 weeks since the release, we’ve already seen a lot of interest from data scientists and engineers in using and contributing to MLflow. MLFlow’s GitHub repository already has 180 forks, and over a dozen contributors have submitted issues and pull requests. In addition, close to 100 people came to our first MLflow meetup last week.

Build a Mobile Gaming Events Data Pipeline with Databricks Delta

The world of mobile gaming is fast paced and requires the ability to scale quickly.  With millions of users around the world generating millions of events per second by means of game play, you will need to calculate key metrics (score adjustments, in-game purchases, in-game actions, etc.) in real-time.  Just as important, a popular game launch or feature will increase event traffics by orders of magnitude and you will need infrastructure to handle this rapid scale.

Highlights from first expanded Spark + AI Summit

Databricks hosted the first expanded Spark + AI Summit (formerly Spark Summit) at Moscone Center in San Francisco just a couple of weeks ago and the conference drew over 4,000 Apache Spark and machine learning enthusiasts.  The overall theme of unifying data + AI technologies and unifying data science + engineering organizations to accelerate innovation resonated in many sessions across Spark + AI Summit 2018, including the keynotes and more than 200 technical sessions on big data and machine learning.

Introducing Stream-Stream Joins in Apache Spark 2.3

Since we introduced Structured Streaming in Apache Spark 2.0, it has supported joins (inner join and some type of outer joins) between a streaming and a static DataFrame/Dataset. With the release of Apache Spark 2.3.0, now available in Databricks Runtime 4.0 as part of Databricks Unified Analytics Platform, we now support stream-stream joins. In this post, we will explore a canonical case of how to use stream-stream joins, what challenges we resolved, and what type of workloads they enable. Let’s start with the canonical use case for stream-stream joins – ad monetization.