The cloud has become the most common way to deliver commercial software, but it requires building products in a very different way from traditional software, which has not been heavily studied in research. I will explain some of these challenges based on my experience at Databricks, a startup that provides a data analytics platform as a service on AWS and Azure. Databricks manages millions of VMs per day to run data engineering and machine learning workloads using Apache Spark, TensorFlow, Python and other software for thousands of customers. Two main challenges arise in this context: (1) building a scalable and reliable control plane that can manage millions of VMs and (2) adapting the data processing software itself (e.g. Apache Spark) for an elastic cloud environment — for instance, autoscaling instead of assuming static clusters. These challenges are especially significant for data analytics workloads whose users constantly push boundaries in terms of scale (number of VMs used, data size, metadata size, number of concurrent users, etc). I’ll describe some of the common challenges that our services faced and some of the main ways that Databricks has extended open source analytics software for the cloud environment (e.g., implementing autoscaling for Apache Spark and designing an ACID storage layer on top of S3 in the Delta Lake project).
Matei Zaharia is an Assistant Professor of Computer Science at Stanford University and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley in 2009, and has worked broadly on other cluster computing and analytics software, including Apache Mesos, Apache Hadoop and MLflow. Today, Matei is a PI in the Stanford DAWN Lab doing research on infrastructure for machine learning, and continues to work on data analytics systems at Databricks. Matei’s research was recognized through the 2014 ACM Doctoral Dissertation Award, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE).