SparkHub Apache® Spark™ Developer Resources
Below are Apache Spark Developer Resources including training, publications, packages, and other Apache Spark resources.
Massive Online Courses
Visit the Databricks’ training page for a list of available courses.
Introduction to Apache Spark
Learn the fundamentals and architecture of Apache Spark, the leading cluster-computing framework among professionals.
Starts on June 15, 2016Enroll Now
Distributed Machine Learning with Apache Spark
Learn the underlying principles required to develop scalable machine learning pipelines and gain hands-on experience using Apache Spark.
Starts on July 6, 2016Enroll Now
Apache Spark Publications
Pick up a copy of Learning Spark for a comprehensive introduction to the Apache Spark ecosystem directly from the project founders.
An introduction to Apache Spark packaged as a video plus coding exercises: the essentials to get started running Spark apps.
In this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark.
Perform real-time analytics using Spark in a fast, distributed, and scalable way.
Create machine learning systems that can scale to tackle even the largest data sets with ease and get real insights for your business with Apache Spark.
Reference Applications demonstrating Apache Spark - brought to you by Databricks.
Spark GraphX in Action starts out with an overview of Apache Spark and the GraphX graph processing API. This example-based tutorial then teaches you how to configure GraphX and use GraphX interactively.
Spark in Action teaches you to use Spark for stream and batch data processing. It starts with an introduction to the Spark architecture and ecosystem followed by a taste of Spark's command line interface.
New Apache Spark Packages
Third-party packages that integrate with Apache Spark
This package contains the code for executing clustering validity indices in Spark. The package includes BD-Silhouette, BD-Dunn, Davies-Bouldin and WSSSE indices.
Massively Distributed Indexing of Time Series
Xgboost Spark package pre-built for linux64 environment
This package implements DFST (Distributed FastShapelet Transform). DFST is the first time series classification algorithm developed for distributed environments (Spark). This algorithm performs a shapelet transform on a data set, trains a Random Forest mod