Now available on PyPi and with docs online, you can install this new release with pip install mlflow as described in the MLflow quickstart guide.
Apache® Spark™ News
Since the completion of the Human Genome Project in 2003, there has been an explosion in data fueled by a dramatic drop in the cost of DNA sequencing, from $3B1 for the first genome to under $1,000 today.
Big data practitioners grapple with data quality issues and data pipeline complexities—it’s the bane of their existence. Whether you are chartered with advanced analytics, developing new machine learning models, providing operational reporting or managing the data infrastructure, the concern with data quality is a common theme. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
In the previous blog post, we introduced the new built-in Apache Avro data source in Apache Spark and explained how you can use it to build streaming data pipelines with the from_avro and to_avro functions. Apache Kafka and Apache Avro are commonly used to build a scalable and near-real-time data pipeline. In this blog post, we introduce how to build more reliable pipelines in Databricks, with the integration of Confluent Schema Registry. This feature is available since Databricks Runtime 4.2.
We are excited to announce the release of Databricks Runtime 5.2 for Machine Learning. This release includes several new features and performance improvements to help developers easily use machine learning on the Databricks Unified Analytics Platform.
Behind these groundbreaking innovations are a small, but fast growing group of talented engineers, developers, and data scientists with deep knowledge of Apache Spark. Armed with expertise in Spark and related technologies like TensorFlow, you can change the trajectory of not only your business but also your career path [check out: upcoming Spark training opportunities at Spark + AI Summit]. To that end, here are the top 5 reasons to become a Spark guru.
As leveraging data becomes a more vital component of organizations’ tech stacks, it becomes increasingly important for data teams to make use of software engineering best-practices. The Databricks platform provides excellent tools for exploratory Apache Spark workflows in notebooks as well as scheduled jobs. But for our production-level data jobs, our team wanted to leverage the power of version control systems like GitHub and CI/CD tools like Jenkins alongside the cluster management power of Databricks. Databricks supports notebook CI/CD concepts (as noted in the post Continuous Integration & Continuous Delivery with Databricks), but we wanted a solution that would allow us to use our existing CI/CD setup to both update scheduled jobs to new library versions and have those same libraries available in the UI for use with interactive clusters.
It’s been six months since we launched MLflow, an open source platform to manage the machine learning lifecycle, and the project has been moving quickly since then. MLflow fills a role that hasn’t been served well in the open source community so far: managing the development lifecycle for ML, including tracking experiments and metrics, building reproducible production pipelines, and deploying ML applications. Although many companies build custom internal platforms for these tasks, there was no open source ML platform before MLflow, and we discovered that the community was excited to build one: since last June, 66 developers from over 30 different companies have contributed to MLflow.
Last week, we released Databricks Runtime 5.1 Beta for Machine Learning. As part of our commitment to provide developers with the latest deep learning frameworks, this release includes the best of these libraries. In particular, our PyTorch addition makes it simple for a developer to simply import the appropriate Python torch modules and start coding, without installing all of its myriad dependencies. In this blog, we briefly cover these additions.
Now available on PyPI and with docs online, you can install this new release with pip install mlflow as described in the MLflow quickstart guide.