Loading Events

« All Events

  • This event has passed.

Spark London: Spark Strata Highlights!

May 25, 2017 @ 6:30 pm - 8:30 pm

A deep dive into Spark SQL’s Catalyst optimizer

Speaker: Herman van Hövell tot Westerflier (Databricks)


Catalyst is becoming one of the most important components in Apache Spark, as it underpins all the major new APIs in Spark 2.x, from DataFrames, Datasets, to streaming. At its core, Catalyst is a general library for manipulating trees.

Herman van Hövell tot Westerflier explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Herman offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how new features such as CBO are implemented using Catalyst. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query. At the end of the talk he will give a small hands-on lab showing how you can use Catalyst in practice.


Herman van Hövell is a Spark committer working on Spark SQL at Databricks. Previously, Herman was a consultant working for clients in banking, manufacturing, and logistics. His interests include database systems, optimization, and simulation.

Machine Learning and Deep Learning on Spark/Hadoop

Speaker: Vartika Singh (Cloudera)


Traditional machine learning and feature engineering algorithms are not efficient enough to extract complex and non linear patterns, which are hallmarks of big data. Deep Learning, on the other hand, helps translate the scale and complexity of the data into elegant solutions. It goes without saying that colocating a data processing pipeline with a deep learning framework makes data exploration/algorithm and model evolution much simpler and at the same time makes data governance and lineage tracking an easier effort. Today we talk about the features available via Spark ML and third party libraries for feature engineering and machine learning and a few Deep Learning frameworks that can be run on a Hadoop/Spark environment to take advantage of cluster resources.


Vartika Singh is a Field Data Science architect at Cloudera with over 15 years of experience in applying machine learning techniques. More recently she has been focussed on architecting big data solutions and pipelines around machine learning and related data engineering use cases.

Spark London: Spark Strata Highlights!


May 25, 2017
6:30 pm - 8:30 pm
Event Category:
Event Tags:


Royal Statistical Society