Apache® Spark™ News

Diving Into Delta Lake: Schema Enforcement & Evolution

Data, like our experiences, is always evolving and accumulating. To keep up, our mental models of the world must adapt to new data, some of which contains new dimensions – new ways of seeing things we had no conception of before. These mental models are not unlike a table’s schema, defining how we categorize and process new information.

Engineering population scale Genome-Wide Association Studies with Apache Spark™, Delta Lake, and MLflow

The advent of genome-wide association studies (GWAS) in the late 2000s enabled scientists to begin to understand the causes of complex diseases such as diabetes and Crohn’s disease at their most fundamental level. However, academic bioinformatics tools to perform GWAS have not kept pace with the growth of genomic data, which has been doubling globally every seven months.

Guest Blog: How Virgin Hyperloop One reduced processing time from hours to minutes with Koalas

At Virgin Hyperloop One, we work on making Hyperloop a reality, so we can move passengers and cargo at airline speeds but at a fraction of the cost of air travel. In order to build a commercially viable system, we collect and analyze a large, diverse quantity of data, including Devloop Test Track runs, numerous test rigs, and various simulation, infrastructure and socio economic data. Most of our scripts handling that data are written using Python libraries with pandas as the main data processing tool that glues everything together. In this blog post, we want to share with you our experiences of scaling our data analytics using Koalas, achieving massive speedups with minor code changes.

Diving Into Delta Lake: Unpacking The Transaction Log

The transaction log is key to understanding Delta Lake because it is the common thread that runs through many of its most important features, including ACID transactions, scalable metadata handling, time travel, and more. In this article, we’ll explore what the Delta Lake transaction log is, how it works at the file level, and how it offers an elegant solution to the problem of multiple concurrent reads and writes.

Productionizing Machine Learning with Delta Lake

For many data scientists, the process of building and tuning machine learning models is only a small portion of the work they do every day. The vast majority of their time is spent doing the less-than-glamorous (but crucial) work of performing ETL, building data pipelines, and putting models into production.

Announcing Databricks Runtime 5.5 and Runtime 5.5 for Machine Learning

Databricks is pleased to announce the release of Databricks Runtime 5.5.  This release includes Apache Spark 2.4.3 along with several important improvements and bug fixes as noted in the latest release notes [Azure|AWS].  We recommend all users upgrade to take advantage of this new runtime release.  This blog post gives a brief overview of some of the new high-value features that increase performance, compatibility, manageability and simplifying machine learning on Databricks.

Announcing Databricks Runtime 5.5 and Runtime 5.5 for Machine Learning

Databricks is pleased to announce the release of Databricks Runtime 5.5.  This release includes Apache Spark 2.4.3 along with several important improvements and bug fixes as noted in the latest release notes [Azure|AWS].  We recommend all users upgrade to take advantage of this new runtime release.  This blog post gives a brief overview of some of the new high-value features that increase performance, compatibility, manageability and simplifying machine learning on Databricks.

Scaling Genomic Workflows with Spark SQL BGEN and VCF Readers

In the past decade, the amount of available genomic data has exploded as the price of genome sequencing has dropped. Researchers are now able to scan for associations between genetic variation and diseases across cohorts of hundreds of thousands of individuals from projects such as the UK Biobank. These analyses will lead to a deeper understanding of the root causes of disease that will lead to treatments for some of today’s most important health problems. However, the tools to analyze these data sets have not kept pace with the growth in data.

Accurately Building Genomic Cohorts at Scale with Delta Lake and Spark SQL

At Databricks we have leveraged innovations in distributed computation, storage, and cloud infrastructure and applied them to genomics to help solve problems that have hindered the ability for organizations to perform joint-genotyping, the “N + 1” problem, and the challenge of scaling to population-level cohorts. Our Unified Analytics Platform for Genomics provides an optimized pipeline that scales to massive clusters and thousands of samples with a single click. In this blog, we explore how to apply those innovations to joint genotyping.

Simplifying Streaming Stock Analysis using Delta Lake and Apache Spark: On-Demand Webinar and FAQ Now Available!

On June 13th, we hosted a live webinar — Simplifying Streaming Stock Analysis using Delta Lake and Apache Spark — with Junta Nakai, Industry Leader – Financial Services at Databricks, John O’Dwyer, Solution Architect at Databricks, and Denny Lee, Technical Product Marketing Manager at Databricks. This is the first webinar in a series of financial services webinars from Databricks and is an extension of the blog post Simplify Streaming Stock Data Analysis Using Delta Lake.