[NLP Seminar] The OLMo Cookbook: Open Recipes for Language Model Data Curation

Speaker

Allen Institute for AI

Host

NLP Meetings Seminar Series

Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it can be challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities, risks and limitations. In this talk, I'll present how we approach data curation research for OLMo, our project to develop and share fully open language models. Reflecting on our journey from OLMo 1 to our latest release of OLMo 2, I'll explore how data curation practices have matured across our work and the broader open data research ecosystem. Finally, I'll examine key challenges and opportunities for open data amid a rapidly changing language model landscape.