Cliff Click, Co-Founder of open source distributed machine learning platform, H20, was one of the speakers at PyData a couple of weeks back in Dallas in which he gave a demo of H20 using IPython Notebook.
H20 uses a hand picked suite of open source tools and libraries to make it easier to solve data challenges using machine learning. One of the features that makes H20 appealing for business is it’s intuitive web based user interface which makes it easy to set up and get started with. H2O’s rapid in-memory parallel processing also make it scaleable and allows iteration and model development in real-time. Click explains H20’s features
We are building a practical machine learning tool and what does practical mean? It means different things to different people. Fast and interactive means in memory, so we’re memory speed not disc speed, so order of magnitude is second to low minutes to build models on hundreds of gigabytes to a terabyte.
So big data, the reason for big data is that you don’t have to go through the pain of so much sampling, you can use everything you have got if your cluster is big enough and we are actually very parsimonious on memory so if you have a CSV on disc that’s a terabyte, it will load into a terabyte in RAM just like that.
That of course drives the requirement that you have to be distributed, so this is distributed computing now, so it’s clustered stuff.
It’s open source, you can see and know how how it works, see what the algorithms doing, see why it’s there, if there is a flaw or some new feature you want to add you can look at the code and see what’s going on.
We have a very nice API for coding distributed computing at scale so it’s based on a map reduce style paradigm and I’ll go as far as you want into how that works when we get there, I am going to demo stuff first but it has nothing to do with Hadoop’s Map Reduce, it’s kind of like the Python version if you will, it’s a very theoretically clean map and reduce and it leads to a very simple coding style so that you can focus on writing and math and the data munging and the code you want to do and not focus on the issues of how to write a cluster, how to define Open-MPI message parsing, none of that happens, you just write straight forward, simple in line code in Java at this point and it will run it distributed and scale out.
It’s very portable, being Java on the inside for the implementation and on the outside layer there is a REST/JSON API and anything that can do REST and JSON can talk to the system and that’s how we’re going to drive it from Python.
Because it’s a cluster, you have to have a cluster from somewhere. I’m going to run a cluster of one on my laptop but you can run it in your private datacenter, you can run it in EC2 or in any cloud computing. Four machines in this room with multicasters enabled could cluster up in just seconds.