San Francisco’s Spark Meetup group is massive, yet whilst going along to some of these events can be a little intimidating, especially for the more socially awkward attendees, the group really does preach to people of all levels.
The first gathering of 2015 kicked off with a review of Spark by Co-organiser Patrick Wendell to get things going, briefly running over what Spark can be used for in distributed storage of large data sets as a fast general processing engine (go to 6:19)
Wendell currently works for Databricks, a company founded by the same guys that created Spark in order to help clients with Cloud based processing of big data using Spark whilst fully maintaining an open source model in order to tap as much value from big data as possible.
With this in mind, Spark grew extremely quickly in 2014, having merged more than 3,500 patches in 2014 as opposed to just over 1,000 in 2013 and having seen contributor numbers jump from 137 to 417 over the same years.
The talk drills into the developments of 2014 (go to 10:55) that have made Spark production ready and made sure that people that are already using it have come back to update previous versions building trust in the tool. These activities include stabilising the API, improving and expanding the UI and dealing with data sets that are much larger than memory.
Wendell explains some of the features and benefits which Spark are looking at for 2015 (go to 13:08) –
- SchemaRDD as a common interchange format
- Data-frame style APIs
- Easy use from Developers to Data Scientists
- Extensibility and pluggable APIs, such as
- Data source API (SQL)
- Pipelines API (MLlib)
- Receiver API (Streaming)
- Spark Packages