Session Details

An Introduction to using Apache Spark for Machine Learning  

Level :
Date :
9:45 AM Saturday
Room :
Interested : (229) - Registered : (84)


Apache Spark is a fast distributed processing engine written in Scala with additional support for Java and Python.

The core data structure of Spark is a Resilient Distributed DataSet that provides fast and flexible in-memory processing yet with support for massive datasets and fault tolerance by running on top of Hadoop/HDFS (and lately additional scalable backends, notably Cassandra).

Spark includes several Modules on top of its Core: this talk will focus on Spark MLLib which is a machine learning library on top of Spark Core. MLLib leverages the memory-resident capabilities of Spark to enable fast implementations of iterative algorithms in ways that are not possible with traditional Map/Reduce on top of Hadoop.

This talk will briefly describe the Spark Core and MLLib architectures and then switch focus into the Scala language implementation of examples in the following areas: Classification Regression Clustering (If time allows: Collaborative Filtering and Feature Selection) Additionally SparkSQL will be briefly discussed and then used to examine some of the results.

This talk will assume working knowledge of scala functional programming methods and constructs.

The Speakers


Stephen Boesch

I am a developer focusing on scalable apps for data pipelines and machine learning on Hadoop and Spark infrastructures. My background is in Java/Oracle/ETL from 1996 until 2011, at which point I started to focus on Hadoop, Spark and Scala. My work has been at a mix of the familiar large Internet/Systems company names and startups.
  • Not Interested
  • Interested
  • Attending