Last week, we released Databricks Runtime 5.1 Beta for Machine Learning. As part of our commitment to provide developers with the latest deep learning frameworks, this release includes the best of these libraries. In particular, our PyTorch addition makes it simple for a developer to simply import the appropriate Python torch modules and start coding, without installing all of its myriad dependencies. In this blog, we briefly cover these additions.
Apache® Spark™ News
Now available on PyPI and with docs online, you can install this new release with pip install mlflow as described in the MLflow quickstart guide.
With recent advances in deep learning frameworks for image classification and object detection, the demand for standard image processing in Apache Spark has never been greater. Image handling and preprocessing have their specific challenges – for example, images come in different formats (eg., jpeg, png, etc.), sizes, and color schemes, and there is no easy way to test for correctness (silent failures).
In this blog, we examine each of the above features through examples, giving you a flavor of its easy API usage, performance improvements, and merits.
We’re excited to announce the general availability of Databricks Runtime 5.0. Included in this release is Spark 2.4. This release offers substantial performance increases within key areas of the platform. Benchmarking workloads have shown a 16% improvement in total execution time and Databricks Delta benefits from substantial improvements to metadata caching, improving query latency by 30%. Beyond these powerful performance improvements we’ve packed this release with many new features and improvements. I’ll highlight some of these now.
Before Spark 2.4, for manipulating the complex types directly, there were two typical solutions: 1) Exploding the nested structure into individual rows, and applying some functions, and then creating the structure again 2) Building a User Defined Function (UDF).
Databricks lowers the barrier to entry and/or time needed to work with large quantities of data in several ways:
We are excited to announce the availability of Apache Spark 2.4 on Databricks as part of the Databricks Runtime 5.0. We want to thank the Apache Spark community for all their valuable contributions to the Spark 2.4 release.
Pivot was first introduced in Apache Spark 1.6 as a new DataFrame feature that allows users to rotate a table-valued expression by turning the unique values from one column into individual columns.
A common use case that we run into at Databricks is that customers looking to perform change data capture (CDC) from one or many sources into a set of Databricks Delta tables. These sources may be on-premises or in the cloud, operational transactional stores, or data warehouses. The common glue that binds them all is they have change sets generated: