Today, we’re excited to announce MLflow v0.7.0, released with new features, including a new MLflow R client API contributed by RStudio. A testament to MLflow’s design goal of an open platform with adoption in the community, RStudio’s contribution extends the MLflow platform to a larger R community of data scientists who use RStudio and R programming language. R is the third language supported in MLflow after Python and Java.
Apache® Spark™ News
Since the Kubernetes cluster scheduler backend was initially introduced in Apache Spark 2.3, the community has been working on a few important new features that make Spark on Kubernetes more usable and ready for a broader spectrum of use cases. The Apache Spark 2.4 release comes with a number of new features, some of which are highlighted below:
When providing recommendations to shoppers on what to purchase, you are often looking for items that are frequently purchased together (e.g. peanut butter and jelly). A key technique to uncover associations between different items is known as market basket analysis. In your recommendation engine toolbox, the association rules generated by market basket analysis (e.g. if one purchases peanut butter, then they are likely to purchase jelly) is an important and useful technique. With the rapid growth e-commerce data, it is necessary to execute models like market basket analysis on increasing larger sizes of data. That is, it will be important to have the algorithms and infrastructure necessary to generate your association rules on a distributed platform. In this blog post, we will discuss how you can quickly run your market basket analysis using Apache Spark MLlib FP-growth algorithm on Databricks.
With the exponential growth of cameras and visual recordings, it is becoming increasingly important to operationalize and automate the process of video identification and categorization. Applications ranging from identifying the correct cat video to visually categorizing objects are becoming more prevalent. With millions of users around the world generating and consuming billions of minutes of video daily, you will need the infrastructure to handle this massive scale.
The volume of data that data scientists face these days increases relentlessly, and we now find that a traditional, single-machine solution is no longer adequate to the demands of these datasets. Over the past few years, Apache Spark has become the standard for dealing with big-data workloads, and we think it promises data scientists huge potential for analysis of large time series. We have developed Flint at Two Sigma to enhance Spark’s functionality for time series analysis. Flint is an open source library and available via Maven and PyPI.
In the last blog post, we demonstrated the ease with which you can get started with MLflow, an open-source platform to manage machine learning lifecycle. In particular, we illustrated a simple Keras/TensorFlow model using MLflow and PyCharm. This time we explore a binary classification Keras network model. Using MLflow’s Tracking APIs, we will track metrics—accuracy and loss–during training and validation from runs between baseline and experimental models. As before we will use PyCharm and localhost to run all experiments.
Today, we’re excited to announce MLflow v0.5.0, MLflow v0.5.1, and MLflow v0.5.2, which were released last week with some new features. MLflow 0.5.2 is already available on PyPI and docs are updated. If you do pip install mlflow as described in the MLflow quickstart guide, you will get the recent release.
This summer, I was a software engineering intern at Databricks on the Machine Learning (ML) Platform team. As part of my intern project, I built a set of MLflow apps that demonstrate MLflow’s capabilities and offer the community examples to learn from.
SparkR UDF API transfers data between Spark JVM and R process back and forth. Inside the UDF function, user gets a wonderful island of R with access to the entire R ecosystem. But unfortunately, the bridge between R and JVM is far from efficient. It currently only allows one “car” to pass on the bridge at any time, and the “car” here is a single field in any Row of a SparkDataFrame. It should not be a surprise that traffic on the bridge is very slow.
In digital advertising, one of the most important things to be able to deliver to clients is information about how their advertising spend drove results. The more quickly we can provide this, the better. To tie conversions or engagements to the impressions served in an advertising campaign, companies must perform attribution. Attribution can be a fairly expensive process, and running attribution against constantly updating datasets is challenging without the right technology. Traditionally, this has not been an easy problem to solve as there are lots of things to reason about: