Apache® Spark™ News

An Update on Project Zen: Improving Apache Spark for Python Users

Apache Spark™ has reached its 10th anniversary with Apache Spark 3.0 which has many significant improvements and new features including but not limited to type hint support in pandas UDF, better error handling in UDFs, and Spark SQL adaptive query execution. It has grown to be one of the most successful open-source projects as the de facto unified engine for data science.  In fact, Apache Spark has now reached the plateau phase of the Gartner Hype cycle in data science and machine learning pointing to its enduring strength.

Enabling Spark SQL DDL and DML in Delta Lake on Apache Spark 3.0

Last week, we had a fun Delta Lake 0.7.0 + Apache Spark 3.0 AMA where Burak Yavuz, Tathagata Das, and Denny Lee provided a recap of Delta Lake 0.7.0 and answered your Delta Lake questions.  The theme for this AMA was the release of Delta Lake 0.7.0 coincided with the release of Apache Spark 3.0 thus enabling a new set of features that were simplified using Delta Lake from SQL.

Profit-Driven Retention Management with Machine Learning

Companies with the highest loyalty ratings and retention rates grew revenues 250% faster than their industry peers and delivered two to five times the shareholder returns over a 10 year period. Earning loyalty and getting the largest number of customers to stick around is something that is in the best interest of both a company and its customer base.

A Comprehensive Look at Dates and Timestamps in Apache Spark™ 3.0

Apache Spark is a very popular tool for processing structured and unstructured data. When it comes to processing structured data, it supports many basic data types, like integer, long, double, string, etc. Spark also supports more complex data types, like the Date and Timestamp, which are often difficult for developers to understand. In this blog post, we take a deep dive into the Date and Timestamp types to help you fully understand their behavior and how to avoid some common issues. In summary, this blog covers four parts:

Analyzing Customer Attrition in Subscription Models

The subscription model is experiencing a renaissance.  Gone are the days of the penny music CD clubs, replaced by an ever-increasing assortment of digital streaming services delivering music, videos and more directly to consumers’ devices in exchange for a modest recurring fee. Today, 70% of US households subscribe to at least one subscription streaming service with an average of 3.4 such subscriptions per subscriber household.

A data-driven approach to Environmental, Social and Governance

The future of finance goes hand in hand with social responsibility, environmental stewardship and corporate ethics. In order to stay competitive, Financial Services Institutions (FSI)  are increasingly  disclosing more information about their environmental, social and governance (ESG) performance. By better understanding and quantifying the sustainability and societal impact of any investment in a company or business, FSIs can mitigate reputation risk and maintain the trust with both their clients and shareholders. At Databricks, we increasingly hear from our customers that ESG has become a C-suite priority. This is not solely driven by altruism but also by economics: Higher ESG ratings are generally positively correlated with valuation and profitability while negatively correlated with volatility. In this blog post, we offer a novel approach to sustainable investing by combining natural language processing (NLP) techniques and graph analytics to extract key strategic ESG initiatives and learn companies’ relationships in a global market and their impact to market risk calculations.