Tristan Naumann is a Ph.D. candidate in Electrical Engineering and Computer Science at MIT working with Professor Peter Szolovits in CSAIL’s Clinical Decision Making group. His research includes exploring relationships in complex, unstructured healthcare data using natural language processing and unsupervised learning techniques. He has been an organizer for workshops and datathon events, which bring together participants with diverse backgrounds in order to address biomedical and clinical questions in a manner that is reliable and reproducible.

Research Areas

Impact Areas




CliNER: Clinical Concept Extraction

Clinical concept extraction (CCE) of named entities - such as problems, tests, and treatments - aids in forming an understanding of notes and provides a foundation for many downstream clinical decision-making tasks. Historically, this task has been posed as a standard named entity recognition (NER) sequence tagging problem, and solved with feature-based methods using hand-engineered domain knowledge. Recent advances, however, have demonstrated the efficacy of LSTM-based models for NER tasks, including CCE. This work presents CliNER 2.0, a simple-to-install, open-source tool for extracting concepts from clinical text. CliNER 2.0 uses a word- and character- level LSTM model, and achieves state-of-the-art performance. For ease of use, the tool also includes pre-trained models available for public use.


Synthetically-Identified Clinical Notes

Clinical notes often describe the most important aspects of a patient's physiology and are therefore critical to medical research. However, these notes are typically inaccessible to researchers without prior removal of sensitive protected health information (PHI), a natural language processing (NLP) task referred to as de-identification. In order to build tools that perform deid, one typically needs the very same data that is private, thus creating a chicken-and-the-egg problem. In this work, we generate "fake" clinical notes where the deidentified information is replaced with real-seeming values (e.g. "Tim Lywood" instead of "George Beveridge") that still respect reasonable distributional semantics. We evaluate models trained on this synthetic data and show that they perform just as well as models trained on the sensitive PHI-bearing notes.

 2 More