Making Big Data Analytics Interactive and Real-Time
Speaker: Matei Zaharia, UC - Berkeley
Date: Tuesday, October 30 2012
Time: 4:00PM to 5:00PM
Refreshments: 3:45PM
Location: 32-G882 (Hewlett Room)
Host: Hari Balakrishnan, CSAIL
Contact: Sheila Marian, x3-1996, sheila@csail.mit.edu
Large-scale data processing platforms like MapReduce and Hadoop have become common to tackle growing data volumes in both industry and research. However, as organizations start using "big data" for more applications, the demands on distributed processing have also grown. In particular, users want to run (1) more interactive ad-hoc queries than is possible with today's batch systems, (2) more complex applications than the single-pass MapReduce model supports, and (3) real-time analytics that incorporate new data within seconds. Meeting these performance goals while preserving the scalability and fault-tolerance of MapReduce is challenging.
In this talk, we present Spark, a new system that tackles these challenges by providing efficient primitives to do data-intensive computation in memory. Spark can outperform Hadoop by up to 30x in interactive queries and multi-pass machine learning and graph algorithms, while giving the same fault tolerance guarantees. The key challenge we address is how to provide fault tolerance for in-memory state efficiently; whereas previous approaches used costly replication or checkpointing, we have designed an abstraction called "resilient distributed datasets" (RDDs) that can recover data without replication, by remembering the operations needed to recompute it. We have also generalized RDDs to support large-scale stream processing through a model called "discretized streams". The resulting system, Spark Streaming, can process tens of millions of records per second on 100 nodes at sub-second latency, and significantly outperforms existing systems.
Finally, a key benefit of RDDs and discretized streams is that they can seamlessly be combined in the same program. The ability to intermix streaming, batch and interactive queries enables rich applications that blur the line between batch and online processing, which some Spark users are already building.
Bio:
Matei Zaharia is a PhD student in the AMP Lab at UC Berkeley, working with Scott Shenker and Ion Stoica on topics in systems, cloud computing and networking. He is also a committer on Apache Hadoop and Apache Mesos. Matei got his undergraduate degree at the University of Waterloo in Canada, and is currently supported by a Google PhD fellowship.
See other events happening in October 2012