As part of Data Civilizer we are designing abstractions and building tools and systems to help people with their data-related tasks, from discovering, to cleaning, to transforming it. The aim is to shape the data in a way that is easy to analyzer---for example to fit a model or fill in a report.

Organizations face a data discovery problem when their analysts spend more time finding relevant data than analyzing it. This problem has become common as: i) data is stored across multiple storage systems, from databases to data lakes; ii) data scientists do not operate within the limits of well-defined schemas, instead they want to find data across their organization to answer increasingly complex business questions. We have built Aurum as part of the Data Civilizer project. Aurum is a system to tackle data discovery problems at large. It introduces a new discovery language, SRQL, that permits users to declare their intuition of what is relevant through a set of data primitives that expose the relations of the underlying data. The algebra relies on an enterprise knowledge graph (EKG) to answer queries in human-scale latencies. Aurum is scalable: it builds the EKG in linear time, despite the complexity of extracting complex relationships among thousands of data sources.

Research Areas