Apache® Spark™ News

How to Use MLflow to Experiment a Keras Network Model: Binary Classification for Movie Reviews

In the last blog post, we demonstrated the ease with which you can get started with MLflow, an open-source platform to manage machine learning lifecycle. In particular, we illustrated a simple Keras/TensorFlow model using MLflow and PyCharm. This time we explore a binary classification Keras network model. Using MLflow’s Tracking APIs, we will track metrics—accuracy and loss–during training and validation from runs between baseline and experimental models. As before we will use PyCharm and localhost to run all experiments.

New Features in MLflow v0.5.2 Release

Today, we’re excited to announce MLflow v0.5.0, MLflow v0.5.1, and MLflow v0.5.2, which were released last week with some new features. MLflow 0.5.2 is already available on PyPI and docs are updated. If you do pip install mlflow as described in the MLflow quickstart guide, you will get the recent release.

100x Faster Bridge between Apache Spark and R with User-Defined Functions on Databricks

SparkR UDF API transfers data between Spark JVM and R process back and forth. Inside the UDF function, user gets a wonderful island of R with access to the entire R ecosystem. But unfortunately, the bridge between R and JVM is far from efficient. It currently only allows one “car” to pass on the bridge at any time, and the “car” here is a single field in any Row of a SparkDataFrame. It should not be a surprise that traffic on the bridge is very slow.

Building a Real-Time Attribution Pipeline with Databricks Delta

In digital advertising, one of the most important things to be able to deliver to clients is information about how their advertising spend drove results.  The more quickly we can provide this, the better. To tie conversions or engagements to the impressions served in an advertising campaign, companies must perform attribution. Attribution can be a fairly expensive process, and running attribution against constantly updating datasets is challenging without the right technology.  Traditionally, this has not been an easy problem to solve as there are lots of things to reason about:

Loan Risk Analysis with XGBoost and Databricks Runtime for Machine Learning

For companies that make money off of interest on loans held by their customer, it’s always about increasing the bottom line. Being able to assess the risk of loan applications can save a lender the cost of holding too many risky assets. It is the data scientist’s job to run analysis on your customer data and make business rules that will directly impact loan approval.

MLflow 0.4.2 Released

Today, we’re excited to announce MLflow v0.4.0, MLflow v0.4.1, and v0.4.2 which we released within the last week with some of the recently requested features. MLflow 0.4.2 is already available on PyPI and docs are updated. If you do pip install mlflow as described in the MLflow quickstart guide, you will get the recent release.

Get Certified on Apache Spark™ with Databricks

In a world of rapidly changing products, companies investing in technology need well-trained experts to run it. Certifications are a key differentiator in a competitive job market because they validate your skills and expertise while keeping you relevant. In fact, certifications may impact career growth more than degrees, since business leaders perceive them as more valuable in developing careers than college courses.

MLflow v0.3.0 Released

Today, we’re excited to announce MLflow v0.3.0, which we released last week with some of the requested features from internal clients and open source users. MLflow 0.3.0 is already available on PyPI and docs are updated. If you do pip install mlflow as described in the MLflow quickstart guide, you will get the recent release.