Apache® Spark™ News

A Guide to Data Engineering Talks at Spark + AI Summit 2019

Big data practitioners grapple with data quality issues and data pipeline complexities—it’s the bane of their existence. Whether you are chartered with advanced analytics, developing new machine learning models, providing operational reporting or managing the data infrastructure, the concern with data quality is a common theme. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.

How to Work with Avro, Kafka, and Schema Registry in Databricks

In the previous blog post, we introduced the new built-in Apache Avro data source in Apache Spark and explained how you can use it to build streaming data pipelines with the from_avro and to_avro functions. Apache Kafka and Apache Avro are commonly used to build a scalable and near-real-time data pipeline. In this blog post, we introduce how to build more reliable pipelines in Databricks, with the integration of Confluent Schema Registry. This feature is available since Databricks Runtime 4.2.

5 Reasons to Become an Apache Spark Expert

Behind these groundbreaking innovations are a small, but fast growing group of talented engineers, developers, and data scientists with deep knowledge of Apache Spark. Armed with expertise in Spark and related technologies like TensorFlow, you can change the trajectory of not only your business but also your career path [check out: upcoming Spark training opportunities at Spark + AI Summit]. To that end, here are the top 5 reasons to become a Spark guru.

Apparate: Managing Libraries in Databricks with CI/CD

As leveraging data becomes a more vital component of organizations’ tech stacks, it becomes increasingly important for data teams to make use of software engineering best-practices. The Databricks platform provides excellent tools for exploratory Apache Spark workflows in notebooks as well as scheduled jobs. But for our production-level data jobs, our team wanted to leverage the power of version control systems like GitHub and CI/CD tools like Jenkins alongside the cluster management power of Databricks. Databricks supports notebook CI/CD concepts (as noted in the post Continuous Integration & Continuous Delivery with Databricks), but we wanted a solution that would allow us to use our existing CI/CD setup to both update scheduled jobs to new library versions and have those same libraries available in the UI for use with interactive clusters.

Kicking Off 2019 with an MLflow User Survey

It’s been six months since we launched MLflow, an open source platform to manage the machine learning lifecycle, and the project has been moving quickly since then. MLflow fills a role that hasn’t been served well in the open source community so far: managing the development lifecycle for ML, including tracking experiments and metrics, building reproducible production pipelines, and deploying ML applications. Although many companies build custom internal platforms for these tasks, there was no open source ML platform before MLflow, and we discovered that the community was excited to build one: since last June, 66 developers from over 30 different companies have contributed to MLflow.

Introducing Databricks Runtime 5.1 for Machine Learning

Last week, we released Databricks Runtime 5.1 Beta for Machine Learning. As part of our commitment to provide developers with the latest deep learning frameworks, this release includes the best of these libraries. In particular, our PyTorch addition makes it simple for a developer to simply import the appropriate Python torch modules and start coding, without installing all of its myriad dependencies. In this blog, we briefly cover these additions.