Gael Varoquaux: DirtyData: statistical learning on non-curated databases
Gael Varoquaux, Inria
Add to Calendar
2022-05-26 11:00:00
2022-05-26 12:00:00
America/New_York
Gael Varoquaux: DirtyData: statistical learning on non-curated databases
Abstract: According to industry surveys, the number one hassle of data scientists is cleaning the data to analyze it. On two specific aspects of dirty data, I will show how machine-learning can readily injest data without curation. Specificaly I will cover prediction from missing values and non-normalized entries. The normalization problem can be tackled with character-level modeling to recover latent caterories. The missing-values problem will lead us to revisit classic statistical results in the setting of supervised learning but leads to models that do not require assumptions on the missing-mechanism without Missing At Random requirements.References:* Cerda, Patricio, Gaël Varoquaux, and Balázs Kégl. "Similarity encoding for learning with dirty categorical variables." Machine Learning 107.8 (2018): 1477-1494. https://link.springer.com/article/10.1007/s10994-018-5724-2* Cerda, Patricio, and Gaël Varoquaux. "Encoding high-cardinality string categorical variables." IEEE Transactions on Knowledge and Data Engineering (2020). https://ieeexplore.ieee.org/abstract/document/9086128* Cvetkov-Iliev, Alexis, Alexandre Allauzen, and Gaël Varoquaux. "Analytics on Non-Normalized Data Sources: more Learning, rather than more Cleaning." IEEE Access (2022). https://ieeexplore.ieee.org/abstract/document/9758752* Le Morvan, M., Josse, J., Moreau, T., Scornet, E., & Varoquaux, G. (2020). NeuMiss networks: differentiable programming for supervised learning with missing values. Advances in Neural Information Processing Systems, 33, 5980-5990. https://proceedings.neurips.cc/paper/2020/hash/42ae1544956fbe6e09242e6cd752444c-Abstract.html* Le Morvan, Marine, Julie Josse, Erwan Scornet, and Gaël Varoquaux. "What’sa good imputation to predict with missing values?." Advances in Neural Information Processing Systems 34 (2021) https://proceedings.neurips.cc/paper/2021/hash/5fe8fdc79ce292c39c5f209d734b7206-Abstract.htmlBio:Gaël Varoquaux is a research director working on data science and health at Inria (French Computer Science National research). His research focuses on statistical-learning tools for data science and scientific inference, with an eye on applications in health and social science. He develops tools to make machine learning easier, with statistical models suited for real-life, uncurated data, and software for data science. For example, since 2008, he has been exploring data-intensive approaches to understand brain function and mental health. He co-funded scikit-learn, one of the reference machine-learning toolboxes, and helped build various central tools for data analysis in Python. Varoquaux has a PhD in quantum physics and is a graduate from Ecole Normale Superieure, Paris.
32G-889 (Hewlett)