Apache® Spark™ News

On-Demand Virtual Session: Customer Lifetime Value

Before you can provide personalized services and offers to your customers, you need to know who they are. In this virtual workshop, retail and media experts will demonstrate how to build advanced customer lifetime value (CLV) models. From there companies can provide the right investment into each customer in order to create personalized offers, save tactics, and experiences.

Simplify Data Conversion from Apache Spark to TensorFlow and PyTorch

Petastorm is a popular open-source library from Uber that enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. We are excited to announce that Petastorm 0.9.0 supports the easy conversion of data from Apache Spark DataFrame to TensorFlow Dataset and PyTorch DataLoader. The new Spark Dataset Converter API makes it easier to do distributed model training and inference on massive data, from multiple data sources. The Spark Dataset Converter API was contributed by Xiangrui Meng, Weichen Xu, and Liang Zhang (Databricks), in collaboration with Yevgeni Litvin and Travis Addair (Uber).

Accelerating Somatic Variant Calling with the Databricks TNSeq Pipeline

Genetic analyses are a critical tool in revolutionizing how we treat cancer. By understanding the mutations present in tumor cells, researchers can gain clues that lead to drug targets and eventually new therapies. At the same time, genetic characterizations of individual tumors enables physicians to tailor treatments to individual patients and improve outcomes while reducing side effects.

Modernizing Risk Management Part 2: Aggregations, Backtesting at Scale and Introducing Alternative Data

Understanding and mitigating risk is at the forefront of any financial services institution. However, as previously discussed in the first blog of this two-part series, banks today are still struggling to keep up with the emerging risks and threats facing their business. Plagued by the limitations of  on-premises infrastructure and legacy technologies, banks until recently have not had the tools to effectively build a modern risk management practice. Luckily, a better alternative exists today based on open-source technologies powered by cloud-native infrastructure. This Modern Risk Management framework  enables intraday views, aggregations on demand and an ability to future proof/scale risk management. In this two-part blog series, we demonstrate how to modernize traditional value-at-risk calculation through the use of Delta Lake, Apache SparkTM and MLflow in order to enable a more agile and forward looking approach to  risk management.

Monitor Your Databricks Workspace with Audit Logs

Cloud computing has fundamentally changed how companies operate – users are no longer subject to the restrictions of on-premises hardware deployments such as physical limits of resources and onerous environment upgrade processes. With the convenience and flexibility of cloud services comes challenges on how to properly monitor how your users utilize these conveniently available resources. Failure to do so could result in problematic and costly anti-patterns (with both cloud provider core resources and a PaaS like Databricks). Databricks is cloud-native by design and thus tightly coupled with the public cloud providers, such as Microsoft and Amazon Web Services, fully taking advantage of this new paradigm, and the audit logs capability provides administrators a centralized way to understand and govern activity happening on the platform. Administrators could use Databricks audit logs to monitor patterns like the number of clusters or jobs in a given day, the users who performed those actions, and any users who were denied authorization into the workspace.