Apache® Spark™ News

Faster SQL: Adaptive Query Execution in Databricks

Earlier this year, Databricks wrote a blog on the whole new Adaptive Query Execution framework in Spark 3.0 and Databricks Runtime 7.0. The blog has sparked a great amount of interest and discussions from tech enthusiasts. Today, we are happy to announce that Adaptive Query Execution (AQE) has been enabled by default in our latest release of Databricks Runtime, DBR 7.3.

Analyzing Algorand Blockchain Data with Databricks Delta

Algorand is a public, decentralized blockchain system that uses a proof of stake consensus protocol. It is fast and energy-efficient, with a transaction commit time under 5 seconds and throughput of one thousand transactions per second. The Algorand system is composed of a network of distributed nodes that work collaboratively to process transactions and add blocks to its distributed ledger.

Diving Into Delta Lake: DML Internals (Update, Delete, Merge)

In previous blogs Diving Into Delta Lake: Unpacking The Transaction Log and Diving Into Delta Lake: Schema Enforcement & Evolution, we described how the Delta Lake transaction log works and the internals of schema enforcement and evolution.  Delta Lake supports DML (data manipulation language) commands including `DELETE`, `UPDATE`, and `MERGE`. These commands simplify change data capture (CDC), audit and governance, and GDPR/CCPA workflows, among others. In this post, we will demonstrate how to use each of these DML commands, describe what Delta Lake is doing behind the scenes when you run one, and offer some performance tuning tips for each one.  More specifically:

It’s an ESG World and We’re Just Living in it

The future of finance goes hand in hand with socially responsible investing, environmental stewardship, and corporate ethics. In order to stay competitive, Financial Services Institutions (FSI) are increasingly disclosing more information about their environmental, social, and corporate governance (ESG) performance. Hence the increasing importance of ESG ratings and ESG scores to investment managers and institutional investors. In fact, the value of data-driven ESG global assets has increased to $40.5 trillion in 2020.

An Update on Project Zen: Improving Apache Spark for Python Users

Apache Spark™ has reached its 10th anniversary with Apache Spark 3.0 which has many significant improvements and new features including but not limited to type hint support in pandas UDF, better error handling in UDFs, and Spark SQL adaptive query execution. It has grown to be one of the most successful open-source projects as the de facto unified engine for data science.  In fact, Apache Spark has now reached the plateau phase of the Gartner Hype cycle in data science and machine learning pointing to its enduring strength.

Enabling Spark SQL DDL and DML in Delta Lake on Apache Spark 3.0

Last week, we had a fun Delta Lake 0.7.0 + Apache Spark 3.0 AMA where Burak Yavuz, Tathagata Das, and Denny Lee provided a recap of Delta Lake 0.7.0 and answered your Delta Lake questions.  The theme for this AMA was the release of Delta Lake 0.7.0 coincided with the release of Apache Spark 3.0 thus enabling a new set of features that were simplified using Delta Lake from SQL.

Profit-Driven Retention Management with Machine Learning

Companies with the highest loyalty ratings and retention rates grew revenues 250% faster than their industry peers and delivered two to five times the shareholder returns over a 10 year period. Earning loyalty and getting the largest number of customers to stick around is something that is in the best interest of both a company and its customer base.

A Comprehensive Look at Dates and Timestamps in Apache Spark™ 3.0

Apache Spark is a very popular tool for processing structured and unstructured data. When it comes to processing structured data, it supports many basic data types, like integer, long, double, string, etc. Spark also supports more complex data types, like the Date and Timestamp, which are often difficult for developers to understand. In this blog post, we take a deep dive into the Date and Timestamp types to help you fully understand their behavior and how to avoid some common issues. In summary, this blog covers four parts: