In the last 10 years, it’s become far more common for physicians to keep records electronically. Those records could contain a wealth of medically useful data: hidden correlations between symptoms, treatments and outcomes, for instance, or indications that patients are promising candidates for trials of new drugs.
Much of that data, however, is buried in physicians’ freeform notes. One of the difficulties in extracting data from unstructured text is what computer scientists call word-sense disambiguation. In a physician’s notes, the word “discharge,” for instance, could refer to a bodily secretion — but it could also refer to release from a hospital. The ability to infer words’ intended meanings makes it much easier for computers to find useful patterns in mountains of data.
At the American Medical Informatics Association’s (AMIA) annual symposium next week, CSAIL researchers will present a new system for disambiguating the senses of words used in doctors’ clinical notes.
On average, the system is 75 percent accurate in disambiguating words with two senses, a marked improvement over previous methods. But more important, says MIT postdoc Anna Rumshisky, it represents a new approach to word disambiguation that could lead to much more accurate systems while drastically reducing the amount of human effort required to develop them.
Indeed, Rumshisky says, the paper that was initially accepted to the AMIA symposium described a system that used a more conventional approach to word disambiguation, with an average accuracy of only about 63 percent. “In our opinion, that wasn’t enough to actually be usable,” Rumshisky says. “So what we tried instead was something that’s been tried before in the general domain but never in the biomedical or clinical domains.”
In particular, Rumshisky explains, she and her co-authors — graduate student Rachel Chasin, whose master's thesis is the basis for the new paper; Peter Szolovits, an MIT professor of computer science and engineering and health science and technology; and research affiliate Özlem Uzuner, who got her PhD at MIT and is now an assistant professor at the University at Albany — adapted algorithms from a research area known as topic modeling. Topic modeling seeks to automatically identify the topics of documents by inferring relationships among prominently featured words.
“The twist on it that we’re trying to transpose from the general domain is to treat occurrences of a target word as documents and to treat senses as hidden topics that we’re trying to infer,” Rumshisky says.
Where an ordinary topic-modeling algorithm will search through huge bodies of text to identify clusters of words that tend to occur in close proximity to each other, Rumshisky and her colleagues’ algorithm identifies correlations not only between words but between words and other textual “features” — such as the words’ syntactic roles. If the word “discharge” is preceded by an adjective, for instance, it’s much more likely to refer to a bodily secretion than to an administrative event.
Ordinarily, topic-modeling algorithms assign different weights to different topics: A single news article, for instance, might be 50 percent about politics, 30 percent about the economy, and 20 percent about foreign affairs. Similarly, the MIT researchers’ new algorithm assigns different weights to the different possible meanings of ambiguous words.
One advantage of topic-modeling algorithms is that they’re “unsupervised”: They can be deployed on huge bodies of text without human oversight. As a consequence, the researchers can keep revising their algorithm so that it incorporates more features, then set it loose on unannotated medical papers to draw its own inferences. And the more features it incorporates, the more accurate it should be, Rumshisky says.
Among the features that the researchers plan to incorporate into the algorithm are listings in a huge thesaurus of medical terms, compiled by the National Institutes of Health, called the Unified Medical Language System (UMLS). Indeed, word associations in the UMLS were the basis of the researchers’ original algorithm — the one that achieved 63 percent accuracy. There, the problem was that the length and structure of the paths from one word to another in the UMLS didn’t always correspond to the semantic difference between the words. But the new system intrinsically identifies only those correspondences that recur with enough frequency that they’re likely to be useful.
“The parts of the [UMLS] that are relevant for distinguishing the senses would basically float to the top by themselves,” Rumshisky says. “It kind of gives you, for free, this association, if it’s valid. If it’s not valid, it just won’t matter.”
The researchers are also experimenting with additional syntactic and semantic features that could help with word disambiguation and with word associations established by NIH’s Medical Subject Headings paper-classification scheme. “It’s still not perfect, because we haven’t integrated all the linguistic features that we want to,” Rumshisky says. “But my hunch is that this is the way to go.”
“About 80 percent of clinical information is buried in clinical notes,” says Hongfang Liu, an associate professor of medical informatics at the Mayo Clinic. “A lot of words or phrases are ambiguous there. So in order to get the correct interpretation, you need to go through the word-disambiguation phase.”
Liu says that while some computational linguists have applied topic-modeling algorithms to the problem of word-sense disambiguation, “My feeling is that they work on kind of toy problems. And here, I think, it can actually be used in production-scale natural-language-processing systems.”