- This event has passed.
Lambda Streaming, Overkill Analytics, BigDL with Apache Spark
April 20, 2017 @ 6:00 pm - 9:00 pm
For April we have an action packed 3 sessions! Thank you Blueprint Consulting for hosting us! Thank you MemSQL for sponsoring food/drink!
6:00-6:30 Food/Drink and socialize
6:35-7:00 Lambda architecture with Spark Streaming and Azure (Gary and Chris from Blueprint Consulting)
7:00-7:45 Overkill Analytics with Spark (Claudiu from ubix.ai)
7:45-8:30 Distributed Deep Learning at Scale on Apache Spark with BigDL (DingDing, Sergey from Intel)
Lambda architecture with Spark Streaming and Azure
In this session, we will present our experience using lambda architecture at Microsoft in a production environment for streaming and batch event data processing on an Azure infrastructure. Specifically:
* How we handle event data ingestion into Spark using Azure Event Hubs and Azure Stream Analytics
* Our experience in the configuration of Azure’s Apache Spark offering, HDInsight
* Our experience in using Apache Hbase and Apache Phoenix as a store for Spark processed, enriched event data
Gary Nakanelua is a Director at Blueprint Consulting helping clients successfully adopt a variety of technology solutions including the Hadoop ecosystem, machine learning and cloud computing. With over 12 years’ experience in design, engineering and IT operations, he has worked with some of the largest technology companies in the world.
Chris Carter is the Director at Blueprint focusing on the Big Data space. Chris has a background in numerous Big Data projects across both the enterprise and startup worlds. His experience spans numerous industry sectors, but most recently focusing on IoT and the application of data science and real-time analytics. He is passionate about technology and data platforms in general, especially when it comes to ensuring technology delivers true value and flexibility to those making an investment in it.
Overkill Analytics with Spark
In our quest for data science automation we have learned many lessons that I am going to share in this session.
Less slides and more demos of the data science behind the Outbrain Kaggle competition and a Twitter Sentiment Analysis application, all performed from our own notebook (called DSL Workbench) we built for exploratory data analysis. DSL is the fluent and expressive API we created to expose data and services from our data science platform.
I will compare multiple approaches for feature engineering, reduction as well as full feature space training employing OKA (OverKill Analytics) techniques: wherespark.ml/spark.mllib could not perform on high dimensional sparse feature spaces we employed Spark for distributing scikit-learn, VW, TensorFlow and R packages and produced ensemble models and prediction tables that still yield highly accurate predictions. These models are then used in for predicting sentiment from tweets in real time.
I will cover and show concrete examples for composite and progressive modeling, high dimensional and sparse feature engineering, the primitives we built for handling sparse data beyond the support in Spark or scipy.
While I’ll focus on data science at scale I will also touch on infrastructure aspects, with tips and tricks we learned with the underlying technology stack: scala, python, Spark, HDFS, Cassandra, ElasticSearch, Zookeeper, VW etc
Claudiu Barbura – Claudiu is VP of Engineering, Analytics Platform at ubix.ai, where he leads the development of the data and advanced analytics services that enable AutoCurious to automate and scale data science for mass insight consumption. Formerly at Atigeo where he architected the xPatterns big data platform.
Intel: Tech -Talk: Distributed Deep Learning At Scale on Apache Spark with BigDL
Intel recently released BigDL, an open source distributed Deep Learning framework for Apache Spark ( https://github.com/intel-analytics/BigDL ). It brings native support for deep learning functionalities to Spark, provides orders of magnitude speedup over other out-of-the-box open source DL frameworks (e.g., Caffe/Torch/TensorFlow) and efficiently scales out deep learning workloads based on Spark architecture. In addition, it allows data scientists to perform distributed deep learning analysis on big data using familiar tools such as Python, notebook, etc.
In this talk we will give a brief introduction to BigDL, give practical examples of how Big Data users and data scientists can leverage BigDL for their deep learning analysis on large amounts of data in a distributed fashion, and provide BigDL performance data.
Through the use of traditional deep learning examples (image recognition, object detection, NLP), we will show how an existing Spark/Hadoop Big Data cluster can be used as a unified data analytics platform for data storage, data processing and mining, feature engineering, traditional machine learning, and deep learning workloads.
Ding Ding is a software engineer on Intel’s Big Data Technology Team, where she works on developing and optimizing distributed machine learning and deep learning algorithms on Apache Spark, focusing on large-scale analytical applications and Spark infrastructure.
Sergey Ermolin is a Silicon Valley’s veteran with a passion for machine learning and artificial intelligence. His interest in neural networks goes back to 1996 when he used them to predict aging behavior of quartz crystals and cesium atomic clocks made by Hewlett-Packard at its Santa Clara campus. Sergey is currently a member of Big Data Technologies team at Intel-Santa Clara, working on Apache Spark projects. Sergey holds MSEE from Stanford and BS in Physics and Mechanical Engineering from Cal State University, Sacramento