Apache® Spark™ News

How We Built Databricks on Google Kubernetes Engine (GKE)

Our release of Databricks on Google Cloud Platform (GCP) was a major milestone toward a unified data, analytics and AI platform that is truly multi-cloud. Databricks on GCP, a jointly-developed service that allows you to store all of your data on a simple, open lakehouse platform, is based on standard containers running on top of Google’s Kubernetes Engine (GKE).

An Experimentation Pipeline for Extracting Topics From Text Data Using PySpark

This post is part of a series of posts on topic modeling. Topic modeling is the process of extracting topics from a set of text documents. This is useful for understanding or summarizing large collections of text documents.  A document can be a line of text, a paragraph or a chapter in a book. The abstraction of a document refers to a standalone unit of text over which we operate. A collection of documents is referred to as a corpus, and multiple corpus, a corpora.

The Delta Between ML Today and Efficient ML Tomorrow

If you are working as a data scientist, you might have your full modelling process sorted and potentially have even deployed a machine learning model into production using MLflow. You might have experimented using MLflow tracking and promoted models using the MLflow Model Registry. You are probably quite happy with the reproducibility this provides, as you are able to track things like code version, cluster set-up and also data location.

AML Solutions at Scale Using Databricks Lakehouse Platform

Anti-Money Laundering (AML) compliance has been undoubtedly one of the top agenda items for regulators providing oversight of financial institutions across the globe. As AML evolved and became more sophisticated over the decades, so have the regulatory requirements designed to counter modern money laundering and terrorist financing schemes. The Bank Secrecy Act of 1970 provided guidance and framework for financial institutions to put in proper controls to monitor financial transactions and report suspicious fiscal activity to relevant authorities. This law provided set the framework for how financial institutes combat money laundering and financial terrorism.

What’s New in Apache Spark™ 3.1 Release for Structured Streaming

Along with providing the ability for streaming processing based on Spark Core and SQL API, Structured Streaming is one of the most important components for Apache Spark™. In this blog post, we summarize the notable improvements for Spark Streaming in the latest 3.1 release, including a new streaming table API, support for stream-stream join and multiple UI enhancements. Also, schema validation and improvements to the Apache Kafka data source deliver better usability. Finally, various enhancements were made for improved read/write performance with FileStream source/sink.

How (Not) to Tune Your Model With Hyperopt

So, you want to build a model. You’ve solved the harder problems of accessing data, cleaning it and selecting features. Now, you just need to fit a model, and the good news is that there are many open source tools available: xgboost, scikit-learn, Keras, and so on. The bad news is also that there are so many of them, and that they each have so many knobs to turn. How much regularization do you need? What learning rate? And what is “gamma” anyway?

Fine-Grained Time Series Forecasting at Scale With Facebook Prophet and Apache Spark: Updated for Spark 3

Advances in time series forecasting are enabling retailers to generate more reliable demand forecasts. The challenge now is to produce these forecasts in a timely manner and at a level of granularity that allows the business to make precise adjustments to product inventories. Leveraging Apache Spark™ and Facebook Prophet, more and more enterprises facing these challenges are finding they can overcome the scalability and accuracy limits of past solutions.

Analyzing Algorand Blockchain Data With Databricks Delta (Part 2)

This article is the second part of a two part blog. In part one, we demonstrated the analysis of operational telemetry data. In part two, we will show how to use Databricks to analyze the transactional aspects of the Algorand blockchain. A robust ecosystem of accounts, transactions and digital assets is essential for the health of the blockchain. Assets are digital tokens that represent reward tokens, cryptocurrencies, supply chain assets, etc. The Algo digital currency price reflects the intrinsic value of the underlying blockchain. Healthy transaction volume indicates user engagement.

Introducing Apache Spark™ 3.1

We are excited to announce the availability of Apache Spark 3.1 on Databricks as part of Databricks Runtime 8.0. We want to thank the Apache Spark™ community for all their valuable contributions to the Spark 3.1 release.

Amplify Insights into Your Industry With Geospatial Analytics

Data science is becoming commonplace and most companies are leveraging analytics and business intelligence to help make data-driven business decisions. But are you supercharging your analytics and decision-making with geospatial data? Location intelligence, and specifically geospatial analytics, can help uncover important regional trends and behavior that impact your business. This goes beyond looking at location data aggregated by zip codes, which interestingly in the US and in other parts of the world is not a good representation of a geographic boundary.