Guest post by Dina Demner-Fushman, MD, PhD, staff scientist at NLM.
“So, what do you do for a living?”
It’s a natural dinner party question, but my answer can prompt that glazed-over look we all dread.
I am a computational linguist, also known (arguably) as a specialist in natural language processing (NLP), and I work at the National Library of Medicine.
If I strike the right tone of excitement and intrigue, I might buy myself a few minutes to explain.
My work combines computer science and linguistics, and since I focus on biomedical and clinical texts, it also requires adding some biological, medical, and clinical know-how to the mix.
I work specifically in biomedical natural language processing (BioNLP). The definition of BioNLP has varied over the years, with the spotlight shifting from one task to another—from text mining to literature-based discovery to pharmacovigilance, for example—but the core purpose has remained essentially unchanged: training computers to automatically understand natural language to speed discovery, whether in service of research, patient care, or public health.
The field has been around for a while. In 1969 NIH researchers Pratt and Pacak described the early hope for what we now call BioNLP in the paper, “Automated processing of medical English,” which they presented at a computational linguistics conference:
The development of a methodology for machine encoding of diagnostic statements into a file, and the capability to retrieve information meaningfully from [a] data file with a high degree of accuracy and completeness, is the first phase towards the objective of processing general medical text.
NLM became involved in the field shortly thereafter, first with the Unified Medical Language System (UMLS) and later with tools to support text processing, such as MetaMap and TextTool, all of which we’ve improved and refined over the years. The more recent Indexing Initiative combines these tools with other machine learning methods to automatically apply MeSH terms to PubMed journal articles. (A human checks the computer’s work, revising as needed.)
These and NLM’s other NLP developments help improve the Library’s services, but they are also freely shared with the world, broadening our impact but more importantly, helping to handle the global proliferation of scientific and clinical text.
It’s that last piece that makes NLP so hot right now.
NLP, we’re finding, can take in large numbers of documents and locate relevant content, summarize text, apply appropriate descriptors, and even answer questions.
It’s every librarian’s—and every geek’s—dream.
But how can we use it?
Imagine, for example, the ever-expanding volume of health information around patients’ adverse reactions to medications. At least four different—and prolific—content streams feed into that pool of information:
- the reactions reported in the literature, frequently in pre-market research (e.g., in the results of clinical trials);
- the labeled reactions, i.e., the reactions described in the official drug labels provided by manufacturers;
- the reactions noted in electronic health records and clinical progress notes; and
- the reactions described by patients in social media.
NLM’s work in NLP—and its funding of extramural research in NLP—is helping develop approaches and resources for extracting and synthesizing adverse drug reactions from all four streams, giving a more complete picture of how people across the spectrum are responding to medications.
It’s a challenging task. Researchers must address different vocabularies and language structures to extract the information, but NLP, and my fellow computational linguists, will, I predict, prove up to it.
Now imagine parents seeking health information regarding their sick child.
NLP can answer their question, first by understanding key elements in the incoming question and then by providing a response, either by drawing upon a database of known answers (e.g., FAQs maintained by the NIH institutes) or by summarizing relevant PubMed or MedlinePlus articles. Such quick access to accurate and trustworthy health information has the potential to save time and to save lives.
We’re not fully there yet, but as our research continues, we get closer.
Maybe it’s time I reconsider how I answer that perennial dinner party question: “I’m a computational linguist, and I help improve health.”
Dina Demner-Fushman, MD, PhD is a staff scientist in NLM’s Lister Hill National Center for Biomedical Communications. She leads research in information retrieval and natural language processing focused on clinical decision-making, answering clinical and consumer health questions, and extracting information from clinical text.