How better machine learning can transform data science

How better machine learning can transform data science
Bookmark and Share

When Kalyan Veeramachaneni joined the Any Scale Learning For All (ALFA) group at MIT’s CSAIL as a postdoc in 2010, he worked on large-scale machine-learning platforms that enable the construction of models from huge data sets. “The question then was how to decompose a learning algorithm and data into pieces, so each piece could be locally loaded into different machines and several models could be learnt independently,” says Veeramachaneni, currently a research scientist at ALFA.

“We then had to decompose the learning algorithm so we could parallelize the compute on each node,” says Veeramachaneni. “In this way, the data on each node could be learned by the system, and then we could combine all the solutions and models that had been independently learned.”

By 2013, once ALFA had built multiple platforms to accomplish these goals, the team started on a new problem: the growing bottleneck caused by the process of translating the raw data into the formats required by most machine-learning systems.

“Machine-learning systems usually require a covariates table in a column-wise format, as well as a response variable that we try to predict,” says Veeramachaneni. “The process to get these from raw data involves curation, syncing and linking of data, and even generating ideas for variables that we can then operationalize and form.”

Much of Veeramachaneni’s recent research has focused on how to automate this lengthy data prep process. “Data scientists go to all these boot camps in Silicon Valley to learn open source big data software like Hadoop, and they come back, and say ‘Great, but we’re still stuck with the problem of getting the raw data to a place where we can use all these tools,’” Veeramachaneni says.

Veeramachaneni and his team are also exploring how to efficiently integrate the expertise of domain experts, “so it won’t take up too much of their time,” he says. “Our biggest challenge is how to use human input efficiently, and how to make the interactions seamless and efficient. What sort of collaborative frameworks and mechanisms can we build to increase the pool of people who participate?”

Read more at MIT News: