Apache® Spark™ News

How to Work with Avro, Kafka, and Schema Registry in Databricks

In the previous blog post, we introduced the new built-in Apache Avro data source in Apache Spark and explained how you can use it to build streaming data pipelines with the from_avro and to_avro functions. Apache Kafka and Apache Avro are commonly used to build a scalable and near-real-time data pipeline. In this blog post, we introduce how to build more reliable pipelines in Databricks, with the integration of Confluent Schema Registry. This feature is available since Databricks Runtime 4.2.

5 Reasons to Become an Apache Spark Expert

Behind these groundbreaking innovations are a small, but fast growing group of talented engineers, developers, and data scientists with deep knowledge of Apache Spark. Armed with expertise in Spark and related technologies like TensorFlow, you can change the trajectory of not only your business but also your career path [check out: upcoming Spark training opportunities at Spark + AI Summit]. To that end, here are the top 5 reasons to become a Spark guru.

Apparate: Managing Libraries in Databricks with CI/CD

As leveraging data becomes a more vital component of organizations’ tech stacks, it becomes increasingly important for data teams to make use of software engineering best-practices. The Databricks platform provides excellent tools for exploratory Apache Spark workflows in notebooks as well as scheduled jobs. But for our production-level data jobs, our team wanted to leverage the power of version control systems like GitHub and CI/CD tools like Jenkins alongside the cluster management power of Databricks. Databricks supports notebook CI/CD concepts (as noted in the post Continuous Integration & Continuous Delivery with Databricks), but we wanted a solution that would allow us to use our existing CI/CD setup to both update scheduled jobs to new library versions and have those same libraries available in the UI for use with interactive clusters.

Kicking Off 2019 with an MLflow User Survey

It’s been six months since we launched MLflow, an open source platform to manage the machine learning lifecycle, and the project has been moving quickly since then. MLflow fills a role that hasn’t been served well in the open source community so far: managing the development lifecycle for ML, including tracking experiments and metrics, building reproducible production pipelines, and deploying ML applications. Although many companies build custom internal platforms for these tasks, there was no open source ML platform before MLflow, and we discovered that the community was excited to build one: since last June, 66 developers from over 30 different companies have contributed to MLflow.

Introducing Databricks Runtime 5.1 for Machine Learning

Last week, we released Databricks Runtime 5.1 Beta for Machine Learning. As part of our commitment to provide developers with the latest deep learning frameworks, this release includes the best of these libraries. In particular, our PyTorch addition makes it simple for a developer to simply import the appropriate Python torch modules and start coding, without installing all of its myriad dependencies. In this blog, we briefly cover these additions.

Introducing Built-in Image Data Source in Apache Spark 2.4

With recent advances in deep learning frameworks for image classification and object detection, the demand for standard image processing in Apache Spark has never been greater. Image handling and preprocessing have their specific challenges – for example, images come in different formats (eg., jpeg, png, etc.), sizes, and color schemes, and there is no easy way to test for correctness (silent failures).

Announcing Databricks Runtime 5.0

We’re excited to announce the general availability of Databricks Runtime 5.0. Included in this release is Spark 2.4. This release offers substantial performance increases within key areas of the platform. Benchmarking workloads have shown a 16% improvement in total execution time and Databricks Delta benefits from substantial improvements to metadata caching, improving query latency by 30%. Beyond these powerful performance improvements we’ve packed this release with many new features and improvements. I’ll highlight some of these now.