Apache® Spark™ News

Using AutoML Toolkit’s FamilyRunner Pipeline APIs to Simplify and Automate Loan Default Predictions

In the post Using AutoML Toolkit to Automate Loan Default Predictions, we had shown how the Databricks Labs’ AutoML Toolkit simplified Machine Learning model feature engineering and model building optimization (MBO).  It also had improved the area-under-the-curve (AUC) from 0.6732 (handmade XGBoost model) to 0.723 (AutoML XGBoost model).  With AutoML Toolkit’s Release 0.6.1, we have upgraded to MLflow version 1.3.0 and introduced a new Pipeline API that simplifies feature generation and inference.

Scaling Hyperopt to Tune Machine Learning Models in Python

Hyperopt is an open-source hyperparameter tuning library written for Python.  With 445,000+ PyPI downloads each month and 3800+ stars on Github as of October 2019, it has strong adoption and community support.  For Data Scientists, Hyperopt provides a general API for searching over hyperparameters and model types. Hyperopt offers two tuning algorithms: Random Search and the Bayesian method Tree of Parzen Estimators.

Scaling Financial Time Series Analysis Beyond PCs and Pandas: On-Demand Webinar, Slides and FAQ Now Available!

On Oct 9th, 2019, we hosted a live webinar —Scaling Financial Time Series Analysis Beyond PCs and Pandas — with Junta Nakai, Industry Leader Financial Services at Databricks, and Ricardo Portilla, Solution Architect at Databricks. This was a live webinar showcasing the content in this blog- Democratizing Financial Time Series Analysis with Databricks.

Democratizing Financial Time Series Analysis with Databricks

The role of data scientists, data engineers, and analysts at financial institutions includes (but is not limited to) protecting hundreds of billions of dollars worth of assets and protecting investors from trillion-dollar impacts, say from a flash crash. One of the biggest technical challenges underlying these problems is scaling time series manipulation.  Tick data, alternative data sets such as geospatial or transactional data, and fundamental economic data are examples of the rich data sources available to financial institutions, all of which are naturally indexed by timestamp. Solving business problems in finance such as risk, fraud, and compliance ultimately rests on being able to aggregate and analyze thousands of time series in parallel. Older technologies, which are RDBMS-based, do not easily scale when analyzing trading strategies or conducting regulatory analyses over years of historical data. Moreover, many existing time series technologies use specialized languages instead of standard SQL or Python-based APIs.

Diving Into Delta Lake: Schema Enforcement & Evolution

Data, like our experiences, is always evolving and accumulating. To keep up, our mental models of the world must adapt to new data, some of which contains new dimensions – new ways of seeing things we had no conception of before. These mental models are not unlike a table’s schema, defining how we categorize and process new information.

Engineering population scale Genome-Wide Association Studies with Apache Spark™, Delta Lake, and MLflow

The advent of genome-wide association studies (GWAS) in the late 2000s enabled scientists to begin to understand the causes of complex diseases such as diabetes and Crohn’s disease at their most fundamental level. However, academic bioinformatics tools to perform GWAS have not kept pace with the growth of genomic data, which has been doubling globally every seven months.

Guest Blog: How Virgin Hyperloop One reduced processing time from hours to minutes with Koalas

At Virgin Hyperloop One, we work on making Hyperloop a reality, so we can move passengers and cargo at airline speeds but at a fraction of the cost of air travel. In order to build a commercially viable system, we collect and analyze a large, diverse quantity of data, including Devloop Test Track runs, numerous test rigs, and various simulation, infrastructure and socio economic data. Most of our scripts handling that data are written using Python libraries with pandas as the main data processing tool that glues everything together. In this blog post, we want to share with you our experiences of scaling our data analytics using Koalas, achieving massive speedups with minor code changes.