Apache® Spark™ News

Detecting Bias with SHAP

The tech industry is also painfully aware that it does not always live up to its purported meritocratic ideals. Pay isn’t a pure function of merit, and story after story tells us that factors like name-brand school, age, race, and gender have an effect on outcomes like salary.

Announcing the MLflow 1.0 Release

Today we are excited to announce the release of MLflow 1.0. Since its launch one year ago, MLflow has been deployed at thousands of organizations to manage their production machine learning workloads, and has become generally available on services like Managed MLflow on Databricks. The MLflow community has grown to over 100 contributors, and the MLflow PyPI package download rate has reached close to 600K times a month. The 1.0 release not only marks the maturity and stability of the APIs, but also adds a number of frequently requested features and improvements.

Hyperparameter Tuning with MLflow, Apache Spark MLlib and Hyperopt

Hyperparameter tuning is a common technique to optimize machine learning models based on hyperparameters, or configurations that are not learned during model training.  Tuning these configurations can dramatically improve model performance. However, hyperparameter tuning can be computationally expensive, slow, and unintuitive even for experts.

Announcing the MLflow 1.0 Release

Today we are excited to announce the release of MLflow 1.0. Since its launch one year ago, MLflow has been deployed at thousands of organizations to manage their production machine learning workloads, and has become generally available on services like Managed MLflow on Databricks. The MLflow community has grown to over 100 contributors, and the MLflow PyPI package download rate has reached close to 600K times a month. The 1.0 release not only marks the maturity and stability of the APIs, but also adds a number of frequently requested features and improvements.

Advanced Analytics with HyperLogLog Functions in Apache Spark

Pre-aggregation is a common technique in the high-performance analytics toolbox. For example, 10 billion rows of website visitation data per hour may be reducible to 10 million rows of visit counts, aggregated by the superset of dimensions used in common queries, a 1000x reduction in data processing volume with a corresponding decrease in processing costs and waiting time to see the result of any query. Further improvements could come from computing higher-level aggregates, e.g., by day in the time dimension or by the site as opposed to by URL.

Spark + AI Summit 2019 Product Announcements and Recap. Watch the keynote recordings today!

Spark + AI Summit 2019, the world’s largest data and machine learning conference for the Apache Spark™ Community, brought nearly 5000 registered data scientists, engineers, and business leaders to San Francisco’s Moscone Center to find out what’s coming next. Watch the keynote recordings today and learn more about the latest product announcements for Apache Spark, MLflow, and our newest open source addition, Delta Lake!

Detecting Financial Fraud at Scale with Decision Trees and MLflow on Databricks

Detecting fraudulent patterns at scale is a challenge, no matter the use case. The massive amounts of data to sift through, the complexity of the constantly evolving techniques, and the very small number of actual examples of fraudulent behavior are comparable to finding a needle in a haystack while not knowing what the needle looks like. In the world of finance, the added concerns with security and the importance of explaining how fraudulent behavior was identified further increases the complexity of the task.

Understanding Dynamic Time Warping

This blog is part 1 of our two-part series Using Dynamic Time Warping and MLflow to Detect Sales Trends. To go to part 2, go to Using Dynamic Time Warping and MLflow to Detect Sales Trends.

A Guide to Healthcare and Life Sciences Talks at Spark + AI Summit 2019

Data and AI are ushering in a new era of precision medicine. The scale of the cloud, combined with advancements in machine learning, are enabling healthcare and life sciences organizations to use their mountains of data—such as electronic health records, genomics, real-world evidence, claims, and more—to drive innovation across the entire ecosystem, from accelerating drug discovery to preventing chronic disease.