Apache® Spark™ News

Fine-Grained Time Series Forecasting at Scale With Facebook Prophet and Apache Spark: Updated for Spark 3

Advances in time series forecasting are enabling retailers to generate more reliable demand forecasts. The challenge now is to produce these forecasts in a timely manner and at a level of granularity that allows the business to make precise adjustments to product inventories. Leveraging Apache Spark™ and Facebook Prophet, more and more enterprises facing these challenges are finding they can overcome the scalability and accuracy limits of past solutions.

Analyzing Algorand Blockchain Data With Databricks Delta (Part 2)

This article is the second part of a two part blog. In part one, we demonstrated the analysis of operational telemetry data. In part two, we will show how to use Databricks to analyze the transactional aspects of the Algorand blockchain. A robust ecosystem of accounts, transactions and digital assets is essential for the health of the blockchain. Assets are digital tokens that represent reward tokens, cryptocurrencies, supply chain assets, etc. The Algo digital currency price reflects the intrinsic value of the underlying blockchain. Healthy transaction volume indicates user engagement.

Introducing Apache Spark™ 3.1

We are excited to announce the availability of Apache Spark 3.1 on Databricks as part of Databricks Runtime 8.0. We want to thank the Apache Spark™ community for all their valuable contributions to the Spark 3.1 release.

Amplify Insights into Your Industry With Geospatial Analytics

Data science is becoming commonplace and most companies are leveraging analytics and business intelligence to help make data-driven business decisions. But are you supercharging your analytics and decision-making with geospatial data? Location intelligence, and specifically geospatial analytics, can help uncover important regional trends and behavior that impact your business. This goes beyond looking at location data aggregated by zip codes, which interestingly in the US and in other parts of the world is not a good representation of a geographic boundary.

Strategies for Modernizing Investment Data Platforms

The appetite for investment was at a historic high in 2020 for both individual and institutional investors. One study showed that “retail traders make up nearly 25% of the stock market following COVID-driven volatility”. Moreover, institutional investors have piled on investments in cryptocurrency, with 36% invested in cryptocurrency, as outlined in Business Insider . As investors gain access to and trade alternative assets such as cryptocurrency, trading volumes have skyrocketed and created new data challenges. Moreover, cutting edge research is no longer restricted to institutional investors on Wall Street — today’s world of investing extends to digital exchanges in Silicon Valley, data-centric market makers, and retail brokers that are investing increasingly in AI-powered tools for investors. Data lakes have become standard for building financial data products and research, but they come with a unique set of challenges:

Burning Through Electronic Health Records in Real Time With Smolder

In previous blogs, we looked at two separate workflows for working with patient data coming out of an electronic health record (EHR). In those workflows, we focused on a historical batch extract of EHR data. However, in the real world, data is continuously inputted into an EHR. For many of the important predictive healthcare analytics use cases, like sepsis prediction or ER overcrowding, we need to work with the clinical data as it flows through the EHR.

How to Manage Python Dependencies in PySpark

Controlling the environment of an application is often challenging in a distributed computing environment – it is difficult to ensure all nodes have the desired environment to execute, it may be tricky to know where the user’s code is actually running, and so on.

Natively Query Your Delta Lake With Scala, Java, and Python

Today, we’re happy to announce that you can natively query your Delta Lake with Scala and Java (via the Delta Standalone Reader) and Python (via the Delta Rust API). Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark™ APIs. The project has been deployed at thousands of organizations and processes more exabytes of data each week, becoming an indispensable pillar in data and AI architectures. More than 75% of the data scanned on the Databricks Platform is on Delta Lake!

How to Train XGBoost With Spark

XGBoost is currently one of the most popular machine learning libraries and distributed training is becoming more frequently required to accommodate the rapidly increasing size of datasets. To utilize distributed training on a Spark cluster, the XGBoost4J-Spark package can be used in Scala pipelines but presents issues with Python pipelines. This article will go over best practices about integrating XGBoost4J-Spark with Python and how to avoid common problems.

Improving the Spark Exclusion Mechanism in Databricks

Ed Note: This article contains references to the term blacklist, a term that the Spark community is actively working to remove from Spark. The feature name will be changed in the upcoming Spark 3.1 release to be more inclusive, and we look forward to this new release.