Google, Intel and Microsoft team up w/CSAIL on new data-driven initiative

MIT professor Sam Madden speaks at DSAIL kick-off

Recent years have seen an explosion in the creation of machine learning models for everything from self-driving cars to social media feeds. Despite the success of these models at perception and simple prediction, they have yet to have a larger impact on traditional enterprise computing and data processing applications.

Applying machine learning inside the enterprise is the ambition behind a new initiative launched yesterday at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) with industry collaborators Google, Intel and Microsoft.

Dubbed the “Data Systems and Artificial Intelligence Lab” (DSAIL), the effort will focus on investigating how machine learning can be used to improve the performance of data-processing systems.
 
For example, enterprise companies often gather huge amounts of data and have trouble analyzing all the data in a timely manner. Using machine learning to improve systems is still in its infancy, and where it has been used, it is mostly to tune systems’ parameters, such as cache size.

“Our ambitions are much bigger,” says MIT professor Tim Kraska, one of the co-directors of the initiative. “We want to investigate how machine learning can improve the system from the ground up by deeply embedding it into the core components of a data processing system, such as scheduling and indexing. This requires a fundamental rethinking of the entire data system architecture.”

DSAIL further aims to develop new techniques that make it easier for enterprises to not just use machine learning but also build and manage machine learning models. Building a model involves more than just running the final algorithm: it is an iterative process of data integration, cleaning, data transformation, feature selection, model building and visualization.

While there exist individual solutions for many of these problems, enterprise companies often struggle to bring those techniques efficiently together. For example, a decision made to clean the data might have profound implications for the final model quality. Even worse, with the huge number of technical choices made during all these steps, it is often very hard for people to retroactively determine which models and data preparation steps work best for different situations, or even to keep track of what they have tried out in the past.

“We are working to define and create a completely new kind of data stack for artificial intelligence,” says MIT professor Sam Madden (pictured above), another co-director of the initiative alongside professor Mike Stonebraker. “You could imagine a first layer that lets you access, integrate and clean the data for building machine learning models, a second workflow layer that helps you manage the models, and then a presentation layer that helps you operate and visualize the models.”

Researchers met today for an all-day summit at the Hyatt Regency in Kendall Square to form the agenda of the initiative. For their part, Google, Intel and Microsoft will work on new lines of research with small groups of CSAIL faculty, including professors John Guttag, Song Han, Stefanie Jegelka and David Sontag.

Multiple groups at CSAIL have already been developing key systems in this space. Madden’s data-discovery tool Data Civilizer, for example, allows organizations to discover related datasets from thousands of distinct business databases and files. Kraska’s work on Northstar helps inexperienced data scientists to quickly build high-quality models.

DSAIL builds on the lab’s existing initiatives that focus on financial technology, cybersecurity and systems approaches to machine learning. It represents an expansion of Intel’s previous five-year collaboration with CSAIL, the Intel Science and Technology Center for Big Data (ISTC).