We’re excited to announce the general availability of Databricks Runtime 5.0. Included in this release is Spark 2.4. This release offers substantial performance increases within key areas of the platform. Benchmarking workloads have shown a 16% improvement in total execution time and Databricks Delta benefits from substantial improvements to metadata caching, improving query latency by 30%. Beyond these powerful performance improvements we’ve packed this release with many new features and improvements. I’ll highlight some of these now.
Apache® Spark™ News
Before Spark 2.4, for manipulating the complex types directly, there were two typical solutions: 1) Exploding the nested structure into individual rows, and applying some functions, and then creating the structure again 2) Building a User Defined Function (UDF).
Databricks lowers the barrier to entry and/or time needed to work with large quantities of data in several ways:
We are excited to announce the availability of Apache Spark 2.4 on Databricks as part of the Databricks Runtime 5.0. We want to thank the Apache Spark community for all their valuable contributions to the Spark 2.4 release.
Pivot was first introduced in Apache Spark 1.6 as a new DataFrame feature that allows users to rotate a table-valued expression by turning the unique values from one column into individual columns.
A common use case that we run into at Databricks is that customers looking to perform change data capture (CDC) from one or many sources into a set of Databricks Delta tables. These sources may be on-premises or in the cloud, operational transactional stores, or data warehouses. The common glue that binds them all is they have change sets generated:
Today, we’re excited to announce MLflow v0.7.0, released with new features, including a new MLflow R client API contributed by RStudio. A testament to MLflow’s design goal of an open platform with adoption in the community, RStudio’s contribution extends the MLflow platform to a larger R community of data scientists who use RStudio and R programming language. R is the third language supported in MLflow after Python and Java.
Since the Kubernetes cluster scheduler backend was initially introduced in Apache Spark 2.3, the community has been working on a few important new features that make Spark on Kubernetes more usable and ready for a broader spectrum of use cases. The Apache Spark 2.4 release comes with a number of new features, some of which are highlighted below:
When providing recommendations to shoppers on what to purchase, you are often looking for items that are frequently purchased together (e.g. peanut butter and jelly). A key technique to uncover associations between different items is known as market basket analysis. In your recommendation engine toolbox, the association rules generated by market basket analysis (e.g. if one purchases peanut butter, then they are likely to purchase jelly) is an important and useful technique. With the rapid growth e-commerce data, it is necessary to execute models like market basket analysis on increasing larger sizes of data. That is, it will be important to have the algorithms and infrastructure necessary to generate your association rules on a distributed platform. In this blog post, we will discuss how you can quickly run your market basket analysis using Apache Spark MLlib FP-growth algorithm on Databricks.
With the exponential growth of cameras and visual recordings, it is becoming increasingly important to operationalize and automate the process of video identification and categorization. Applications ranging from identifying the correct cat video to visually categorizing objects are becoming more prevalent. With millions of users around the world generating and consuming billions of minutes of video daily, you will need the infrastructure to handle this massive scale.