Data Tamer: A Next-generation Data Curation System
Data curation is the act of discovering a data source(s) of interest, cleaning and transforming the new data, semantically integrating it with other local data sources, and deduplicating the resulting composite. There has been much research on the various components of curation (especially data integration and deduplication). However, there has been little work on collecting all of the curation components into an integrated system. Data Tamer is such an integrated system.
In addition, most of the previous work will not scale to the sizes of problems that we are finding in the field. For example, one web aggregator (Goby.com) requires the curation of 80,000 URLs and a second biotech company (Novartis) has the problem of curating 8000 spreadsheets. At this scale, data curation cannot be a manual (human) effort, but must entail machine learning approaches with a human assist when necessary. Moreover, the problems we are encountering in real enterprises do not necessarily follow the typical machine learning paradigm. Among other issues, all curation must be incremental, as new data sources are uncovered and must be curated over time.
We are working on building Data Tamer into a complete end-to-end system. We are looking for MEng students in the following areas:
* machine learning algorithms to perform attribute identification, grouping of attributes into tables, transformation of incoming data and deduplication.
* data visualization so a human can examine a data source at will and specify manual transformations, as necessary