Download the Customer Lifetimes Part 2 notebook to demo the solution covered below, and watch the on-demand virtual workshop to learn more. You can also go to Part 1 to learn how to estimate customer lifetime duration.
Apache® Spark™ News
On-Demand Virtual Session: Customer Lifetime Value
Before you can provide personalized services and offers to your customers, you need to know who they are. In this virtual workshop, retail and media experts will demonstrate how to build advanced customer lifetime value (CLV) models. From there companies can provide the right investment into each customer in order to create personalized offers, save tactics, and experiences.
Simplify Data Conversion from Apache Spark to TensorFlow and PyTorch
Petastorm is a popular open-source library from Uber that enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. We are excited to announce that Petastorm 0.9.0 supports the easy conversion of data from Apache Spark DataFrame to TensorFlow Dataset and PyTorch DataLoader. The new Spark Dataset Converter API makes it easier to do distributed model training and inference on massive data, from multiple data sources. The Spark Dataset Converter API was contributed by Xiangrui Meng, Weichen Xu, and Liang Zhang (Databricks), in collaboration with Yevgeni Litvin and Travis Addair (Uber).
Accelerating Somatic Variant Calling with the Databricks TNSeq Pipeline
Genetic analyses are a critical tool in revolutionizing how we treat cancer. By understanding the mutations present in tumor cells, researchers can gain clues that lead to drug targets and eventually new therapies. At the same time, genetic characterizations of individual tumors enables physicians to tailor treatments to individual patients and improve outcomes while reducing side effects.
Data Teams Unite! Countdown to Spark + AI Summit
When we made the decision to transform Summit into a completely virtual event, we wanted to make sure that we could offer you as much of the in-person experience as possible — but in a completely virtual environment.
Modernizing Risk Management Part 2: Aggregations, Backtesting at Scale and Introducing Alternative Data
Understanding and mitigating risk is at the forefront of any financial services institution. However, as previously discussed in the first blog of this two-part series, banks today are still struggling to keep up with the emerging risks and threats facing their business. Plagued by the limitations of on-premises infrastructure and legacy technologies, banks until recently have not had the tools to effectively build a modern risk management practice. Luckily, a better alternative exists today based on open-source technologies powered by cloud-native infrastructure. This Modern Risk Management framework enables intraday views, aggregations on demand and an ability to future proof/scale risk management. In this two-part blog series, we demonstrate how to modernize traditional value-at-risk calculation through the use of Delta Lake, Apache SparkTM and MLflow in order to enable a more agile and forward looking approach to risk management.
Customer Lifetime Value Part 1: Estimating Customer Lifetimes
Download the Customer Lifetimes Part 1 notebook to demo the solution covered below, and watch the on-demand virtual workshop to learn more. You can also go to Part 2 to learn how to estimate future customer spend.
Monitor Your Databricks Workspace with Audit Logs
Cloud computing has fundamentally changed how companies operate – users are no longer subject to the restrictions of on-premises hardware deployments such as physical limits of resources and onerous environment upgrade processes. With the convenience and flexibility of cloud services comes challenges on how to properly monitor how your users utilize these conveniently available resources. Failure to do so could result in problematic and costly anti-patterns (with both cloud provider core resources and a PaaS like Databricks). Databricks is cloud-native by design and thus tightly coupled with the public cloud providers, such as Microsoft and Amazon Web Services, fully taking advantage of this new paradigm, and the audit logs capability provides administrators a centralized way to understand and govern activity happening on the platform. Administrators could use Databricks audit logs to monitor patterns like the number of clusters or jobs in a given day, the users who performed those actions, and any users who were denied authorization into the workspace.
Vectorized R I/O in Upcoming Apache Spark 3.0
R is one of the most popular computer languages in data science, specifically dedicated to statistical analysis with a number of extensions, such as RStudio addins and other R packages, for data processing and machine learning tasks. Moreover, it enables data scientists to easily visualize their data set.
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
This is a joint engineering effort between the Databricks Apache Spark engineering team — Wenchen Fan, Herman van Hovell and MaryAnn Xue — and the Intel engineering team — Ke Jia, Haifeng Chen and Carson Wang.