Apache® Spark™ News

Announcing Databricks Runtime 5.5 and Runtime 5.5 for Machine Learning

Databricks is pleased to announce the release of Databricks Runtime 5.5.  This release includes Apache Spark 2.4.3 along with several important improvements and bug fixes as noted in the latest release notes [Azure|AWS].  We recommend all users upgrade to take advantage of this new runtime release.  This blog post gives a brief overview of some of the new high-value features that increase performance, compatibility, manageability and simplifying machine learning on Databricks.

Scaling Genomic Workflows with Spark SQL BGEN and VCF Readers

In the past decade, the amount of available genomic data has exploded as the price of genome sequencing has dropped. Researchers are now able to scan for associations between genetic variation and diseases across cohorts of hundreds of thousands of individuals from projects such as the UK Biobank. These analyses will lead to a deeper understanding of the root causes of disease that will lead to treatments for some of today’s most important health problems. However, the tools to analyze these data sets have not kept pace with the growth in data.

Accurately Building Genomic Cohorts at Scale with Delta Lake and Spark SQL

At Databricks we have leveraged innovations in distributed computation, storage, and cloud infrastructure and applied them to genomics to help solve problems that have hindered the ability for organizations to perform joint-genotyping, the “N + 1” problem, and the challenge of scaling to population-level cohorts. Our Unified Analytics Platform for Genomics provides an optimized pipeline that scales to massive clusters and thousands of samples with a single click. In this blog, we explore how to apply those innovations to joint genotyping.

Simplifying Streaming Stock Analysis using Delta Lake and Apache Spark: On-Demand Webinar and FAQ Now Available!

On June 13th, we hosted a live webinar — Simplifying Streaming Stock Analysis using Delta Lake and Apache Spark — with Junta Nakai, Industry Leader – Financial Services at Databricks, John O’Dwyer, Solution Architect at Databricks, and Denny Lee, Technical Product Marketing Manager at Databricks. This is the first webinar in a series of financial services webinars from Databricks and is an extension of the blog post Simplify Streaming Stock Data Analysis Using Delta Lake.

Detecting Bias with SHAP

The tech industry is also painfully aware that it does not always live up to its purported meritocratic ideals. Pay isn’t a pure function of merit, and story after story tells us that factors like name-brand school, age, race, and gender have an effect on outcomes like salary.

Announcing the MLflow 1.0 Release

Today we are excited to announce the release of MLflow 1.0. Since its launch one year ago, MLflow has been deployed at thousands of organizations to manage their production machine learning workloads, and has become generally available on services like Managed MLflow on Databricks. The MLflow community has grown to over 100 contributors, and the MLflow PyPI package download rate has reached close to 600K times a month. The 1.0 release not only marks the maturity and stability of the APIs, but also adds a number of frequently requested features and improvements.

Hyperparameter Tuning with MLflow, Apache Spark MLlib and Hyperopt

Hyperparameter tuning is a common technique to optimize machine learning models based on hyperparameters, or configurations that are not learned during model training.  Tuning these configurations can dramatically improve model performance. However, hyperparameter tuning can be computationally expensive, slow, and unintuitive even for experts.

Announcing the MLflow 1.0 Release

Today we are excited to announce the release of MLflow 1.0. Since its launch one year ago, MLflow has been deployed at thousands of organizations to manage their production machine learning workloads, and has become generally available on services like Managed MLflow on Databricks. The MLflow community has grown to over 100 contributors, and the MLflow PyPI package download rate has reached close to 600K times a month. The 1.0 release not only marks the maturity and stability of the APIs, but also adds a number of frequently requested features and improvements.

Advanced Analytics with HyperLogLog Functions in Apache Spark

Pre-aggregation is a common technique in the high-performance analytics toolbox. For example, 10 billion rows of website visitation data per hour may be reducible to 10 million rows of visit counts, aggregated by the superset of dimensions used in common queries, a 1000x reduction in data processing volume with a corresponding decrease in processing costs and waiting time to see the result of any query. Further improvements could come from computing higher-level aggregates, e.g., by day in the time dimension or by the site as opposed to by URL.

Spark + AI Summit 2019 Product Announcements and Recap. Watch the keynote recordings today!

Spark + AI Summit 2019, the world’s largest data and machine learning conference for the Apache Spark™ Community, brought nearly 5000 registered data scientists, engineers, and business leaders to San Francisco’s Moscone Center to find out what’s coming next. Watch the keynote recordings today and learn more about the latest product announcements for Apache Spark, MLflow, and our newest open source addition, Delta Lake!