The People Driving Large Scale Machine Learning with Apache Spark

Home / Data Science & Machine Learning / The People Driving Large Scale Machine Learning with Apache Spark

April 26, 2015

Data Science & Machine Learning, Deep Learning, Home

The popular benchmark banded around about Apache Spark‘s machine learning library MLlib is that it is ten times faster than Hadoop based Apache Mahout as an environment for building scalable machine learning applications and a hundred times faster than Hadoop MapReduce.

It’s scalabilty is also proven in production to over 8,000 nodes with the ability to cache datasets in memory for interactive data analysis and repeated querying whilst providing support for structured and relational query processing. Installation is simple too.. install Spark.. it’s part of the engine.

Joseph Bradley works at Databricks, a company founded by the creators of Apache Spark and regularly talks about data technology in the context of Databricks, including MLlib, including the recent Spark Summit (East) in New York City.

One of the challenges presented by machine learning is being able to prepare data through a sequence of events before it can be used which is where machine learning pipelines filter unstructured data through pre-processing, feature extraction, model fitting and validation.

This pipeline creation is something that Databricks say is ‘often ignored in academia and it has received largely ad-hoc treatment in industry, where development tends to occur in manual one-off pipeline implementations.’

Bradley’s talk here gives an introduction to machine learning pipelines and how the current work being done by Databricks and UC Berkley aims to make scalable machine learning as simple as possible by using feature transformations and tokenizers to essentially bag words up which can be converted into feature vectors to help assemble pipelines.

Databricks are keen to get the benefits bestowed in using Mllib deep into the community and last month’s NYC Machine Learning Meetup also saw another speaker, Databricks Software Engineer and Spark Committer, Xiangrui Meng extend Bradley’s talk about the use and development of MLlib.

Meng reinforced the notion of extremely easy scaling for machine learning through the use of pipelines. The inspiration for MLlib and it’s simple APIs which allow users to put together and tune pipelines is open source Python machine learning library sci-kit learn, a project which sprouted from Google’s summer of Code.

MLlib is more tailored to commercial use, Meng says, “We try to have our own design and to fit the big data world, the first part is that we propose an interface for multi-model training. It’s very important if you can train a model with multiple datasets because usually if you have a big dataset you don’t care about the computation, most of the computation time is caused by communication.”

The future roadmap for MLlib is keeping Bradley and Meng very occupied with plans to include more algorithms under the pipeline API, attributes, feature names for categorical data and more feature transformers as well as pipeline persistence which means that a trained pipeline can be saved to disc without having to spend hours retraining. SparkR integration is also planned.

The evolution of MLlib is especially useful for organisations like Alpine DataLabs that need powerful components to help them deliver predictive analytics solutions at enterprise level.

Alpine Machine Learning Lead DB Tsai gave a talk recently at IoTaConf explaining how machine learning constitutes a large part of Alpine’s efforts in helping enterprise analyse unstructured data without investing risk and capital in the assembly of specialist teams. Given the scale at which Alpine operate, MLlib would be able to stretch it’s legs in production, perhaps even to over 8,000 nodes at which it has reportedly been proven.

The use of MLlib at Alpine was an obvious one as Tsai explained, “We started to work with Hadoop but we found there are lots of limitations with it.. lots of machine learning algorithms are iterative so how can Hadoop work well in this situation because there are lots of i/o overheads inside Hadoop when you read the data from HDFS. So we found Spark is the best platform for us to do machine learning because it’s a memory caching mechanism that allows us to cache the data when we see the data for the first time and the second time when we iterate through the data we just read it from memory. That’s why we put a lot of resources in Spark and work close with the community to contribute code to the Spark project.”

Tsai’s presentation walks through the Spark architecture including transformation and the fundamentals idea of how Spark works and how to cache inside Spark with a couple of example algorithms to show how to use MLlib.

2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference from DB Tsai

apache Deep Learning Hadoop Machine Learning Meetup spark

[email protected]