Apache® Spark™ News

Advanced Analytics with HyperLogLog Functions in Apache Spark

Pre-aggregation is a common technique in the high-performance analytics toolbox. For example, 10 billion rows of website visitation data per hour may be reducible to 10 million rows of visit counts, aggregated by the superset of dimensions used in common queries, a 1000x reduction in data processing volume with a corresponding decrease in processing costs and waiting time to see the result of any query. Further improvements could come from computing higher-level aggregates, e.g., by day in the time dimension or by the site as opposed to by URL.

Spark + AI Summit 2019 Product Announcements and Recap. Watch the keynote recordings today!

Spark + AI Summit 2019, the world’s largest data and machine learning conference for the Apache Spark™ Community, brought nearly 5000 registered data scientists, engineers, and business leaders to San Francisco’s Moscone Center to find out what’s coming next. Watch the keynote recordings today and learn more about the latest product announcements for Apache Spark, MLflow, and our newest open source addition, Delta Lake!

Detecting Financial Fraud at Scale with Decision Trees and MLflow on Databricks

Detecting fraudulent patterns at scale is a challenge, no matter the use case. The massive amounts of data to sift through, the complexity of the constantly evolving techniques, and the very small number of actual examples of fraudulent behavior are comparable to finding a needle in a haystack while not knowing what the needle looks like. In the world of finance, the added concerns with security and the importance of explaining how fraudulent behavior was identified further increases the complexity of the task.

Understanding Dynamic Time Warping

This blog is part 1 of our two-part series Using Dynamic Time Warping and MLflow to Detect Sales Trends. To go to part 2, go to Using Dynamic Time Warping and MLflow to Detect Sales Trends.

A Guide to Healthcare and Life Sciences Talks at Spark + AI Summit 2019

Data and AI are ushering in a new era of precision medicine. The scale of the cloud, combined with advancements in machine learning, are enabling healthcare and life sciences organizations to use their mountains of data—such as electronic health records, genomics, real-world evidence, claims, and more—to drive innovation across the entire ecosystem, from accelerating drug discovery to preventing chronic disease.

A Guide to Data Engineering Talks at Spark + AI Summit 2019

Big data practitioners grapple with data quality issues and data pipeline complexities—it’s the bane of their existence. Whether you are chartered with advanced analytics, developing new machine learning models, providing operational reporting or managing the data infrastructure, the concern with data quality is a common theme. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.