Tristan Naumann is now at Microsoft Research’s Healthcare NExT working on problems related to clinical natural language processing and machine reading. My research focuses on exploring relationships in complex, unstructured healthcare data using natural language processing and unsupervised learning techniques. Previously, I completed a PhD in the Clinical Decision Making group at MIT CSAIL with Prof. Peter Szolovits, where the research focused on leveraging text representations for clinical predictive tasks, combining structured and unstructured healthcare data. My work has appeared in KDD, AAAI, AMIA, JMIR, Science Translational Medicine, and Nature Translational Psychiatry.

While at MIT, I was an Instructor for HST.953 (Collaborative Data Science for Medicine) and co-authored its textbook, “Secondary Analysis of Electronic Health Records.” I served as the General Chair for the NIPS 2018 Machine Learning for Health (ML4H) workshop, and co-organized the NIPS 2017 ML4H workshop, the COLING 2016 Clinical NLP workshop, and several “datathon” events, which bring together participants to address problems of clinical interest. I also served as a mentor for the MIT Summer Research Program (MSRP), and has spent time as a Software Engineering Intern at Intel Corporation. Prior to MIT, I was a Program Manager at Microsoft Corporation, an Associate Product Manager Intern at Google, and received B.S. and M.S degrees in computer science from Columbia University. While at Columbia University, I was a MS-TA fellow and recipient of the Andrew P. Kosoresow Memorial Award for Outstanding Performance in TA-ing and Service.

Research Areas

Impact Areas




CliNER: Clinical Concept Extraction

Clinical concept extraction (CCE) of named entities - such as problems, tests, and treatments - aids in forming an understanding of notes and provides a foundation for many downstream clinical decision-making tasks. Historically, this task has been posed as a standard named entity recognition (NER) sequence tagging problem, and solved with feature-based methods using hand-engineered domain knowledge. Recent advances, however, have demonstrated the efficacy of LSTM-based models for NER tasks, including CCE. This work presents CliNER 2.0, a simple-to-install, open-source tool for extracting concepts from clinical text. CliNER 2.0 uses a word- and character- level LSTM model, and achieves state-of-the-art performance. For ease of use, the tool also includes pre-trained models available for public use.


Synthetically-Identified Clinical Notes

Clinical notes often describe the most important aspects of a patient's physiology and are therefore critical to medical research. However, these notes are typically inaccessible to researchers without prior removal of sensitive protected health information (PHI), a natural language processing (NLP) task referred to as de-identification. In order to build tools that perform deid, one typically needs the very same data that is private, thus creating a chicken-and-the-egg problem. In this work, we generate "fake" clinical notes where the deidentified information is replaced with real-seeming values (e.g. "Tim Lywood" instead of "George Beveridge") that still respect reasonable distributional semantics. We evaluate models trained on this synthetic data and show that they perform just as well as models trained on the sensitive PHI-bearing notes.

 2 More