As data becomes more massive and complex, it still needs to be accessible to be of much use. Building large-scale software systems for data management is one way to take data and efficiently transform it into a useable form. This work ranges from building more efficient, conventional softcore systems like relational databases to building new types of software systems like query processing over video that existing software cannot access. Using machine learning and other sophisticated tools to improve software efficiency, it is possible to make all types of data accessible to everyone.
Professor Samuel Madden, a Professor of Electrical Engineering and Computer Science and principal investigator in MIT CSAIL, leads the BigData@CSAIL initiative and the Data Systems Group. He investigates issues related to systems and algorithms for data that is high rate, massive, or complex. The goal of his research is to help people access different types of data through building interfaces that let them interact with or query that data.
As a prototypical example, a relational database system runs SQL queries over tables of data to look up a fact. In business, this type of query might be the earnings over the last quarter or the number of employees in a company. Prof. Madden and his research group take this kind of simple abstraction provided by these database systems and apply it to much more complicated types of data.
One such project with more complicated data that Prof. Madden is working on is a system that takes the input from people’s smartphones that capture GPS traces or satellite imagery, and generates digital maps as the output. This system is especially important for regions of the world that are less developed, where maps might not be high-quality or up-to-date. The combination of both satellite imagery and GPS traces allows for more accurate digital maps because the data accounts for the interconnectivity of roads at complicated interchanges that are difficult to determine from one type of data alone. Projects like this one have required the researchers to apply a number of different techniques from the data-processing community and sophisticated technologies, including machine learning.
Machine learning is especially useful for learning the critical components of complex systems. Inside of a database system, there are various data structures that are critical to the operation of these systems. What Prof. Madden has observed in his research in CSAIL is that these internal components, which are traditionally algorithms or data structures that are hand-engineered, can actually be synthesized or tuned through machine learning, making the whole system much more efficient.
Innovative techniques for these types of complex data management systems are not limited to just software. In Prof. Madden’s PhD work at the University of California, Berkeley, he built a data-management system for networks of sensor devices. The system, called TinyDB, took a collection of tiny wireless connected sensors and treated them as though they are database systems. Instead of running a query over a database system asking for the salary of employees working at a company, TinyDB allowed you to run a query over a collection of sensors to ask for a property, such as how the temperature is varying throughout a building.
When Prof. Madden first came to CSAIL, he worked on column-oriented databases, which focused on how data is stored and represented inside of a database system. Typically, database systems lay out data into tables. A table could represent for example, all of the employees who work at an organization, and that table would have one row per employee with fields such as hiring date and salary. Prof. Madden found that instead of laying out the data physically in computer memory by row, if the data was laid out column by column (e.g., all the names were stored together and all the salaries stored together), this arrangement led to running more efficient queries over larger amounts of data.
Another technique that has made querying more efficient for Prof. Madden is a recent project for running queries over large archives of video. He developed a machine-learning algorithm that avoids looking at every frame of video, which is very computationally intensive, and enables users to ask questions about certain regions of the video that will satisfy the predicate that the user asked about.
Through this ongoing data management research, Prof. Madden continues to find new ways to help people query and discover relevant data in more user-friendly, faster, and efficient ways.
Professor Madden's research is in the area of database systems, focusing on database analytics and query processing, ranging from clouds to sensors to modern high performance server architectures. He joined the faculty in January, 2004 receiving his Ph.D. in 2003 from the University of California, Berkeley.