Making Connections and Enabling Discoverability – Celebrating 30 Years of UMLS

Guest post by NLM staff: David Anderson, UMLS Production Coordinator; Liz Amos, Special Assistant to the Chief Health Data Standards Officer; Anna Ripple, Information Research Specialist; and Patrick McLaughlin, Head, Terminology QA & User Services Unit.

Shortly after Donald A.B. Lindberg, MD was sworn in as NLM Director in 1984, he asked “What is NLM, as a government agency, uniquely positioned to do?” Through conversations with experts, Dr. Lindberg identified a looming question in the field of bioinformatics — How can machines act as if they understand biomedical meaning? At the time, the information necessary to answer this question was distributed across a variety of resources. Very few publicly available tools for processing biomedical text had been developed. NLM had experience with terminology development and maintenance (MeSH – Medical Subject Headings), coordinating distributed systems (DOCLINE), and distributing and providing access to large datasets (MEDLINE) in an era when this was a challenge.

As a national library, NLM was deeply interested in providing good answers to biomedical questions. For these reasons, NLM was uniquely positioned to develop a system — the Unified Medical Language System (UMLS) — that could lay the groundwork for machines to act as if they understand biomedical meaning. This year marks the 30th anniversary of the release of the first edition of the UMLS in November 1990.

Achieving the Unified Medical Language System

The result of a large-scale, NLM-led research and development project, the UMLS began with the audacious goal of helping computer systems behave as if they understand the meaning of the language of biomedicine and health. The UMLS was expected to facilitate the development of systems that could retrieve, integrate, and aggregate conceptually-related information from disparate electronic sources such as literature databases, clinical records, and databanks despite differences in the vocabularies and coding systems used within them, and in the terminology employed by users.  

Betsy Humphreys (left) and Dr. Lindberg (right) tout the release of the Unified Medical Language System in 1990.

Under the direction of Dr. Donald Lindberg, then-Deputy Associate Director for Library Operations, Betsy Humphreys, and a multidisciplinary, international team from academia and the private sector, the UMLS evolved into an essential tool for enabling interoperability, natural language processing, information retrieval, machine learning, and  other data science use cases.

UMLS Knowledge Sources

Central to the UMLS model is the grouping of synonymous names into UMLS concepts and the assignment of broad categories (semantic types) to all those concepts. Since its first release in 1990, NLM has continued to expand and update the UMLS Knowledge Sources based on feedback from testing and use.

The UMLS Metathesaurus was the first biomedical terminology resource organized by concept, and its development had a significant impact on subsequent medical informatics theory and practice. The broad terminology coverage, synonymy, and semantic categorization in the UMLS, in combination with its lexical tools, enable its primary use cases:

  • identifying meaning in text,
  • mapping between vocabularies, and
  • improving information retrieval.

The growing increase in UMLS use over the past decade reflects broad developments in health policy, including the designation of SNOMED CT, LOINC, and RxNorm (three component vocabularies included in the UMLS Metathesaurus) as U.S. national standards for clinical data for quality improvement payment programs such as CMS’s Promoting Interoperability Programs (previously known as Meaningful Use). Many UMLS source vocabularies are also referenced in the United States Core Data for Interoperability (USCDI). Researchers continue to rely on the UMLS as a knowledge base for natural language processing and data mining. The UMLS community of users has developed several tools that enhance and expand the capabilities of the UMLS.

Celebrating 30 Years

Thirty years after the initial release of the UMLS Knowledge Sources, the UMLS resources continue to be of benefit to millions of people worldwide. The UMLS is used in NLM flagship applications such as PubMed and ClinicalTrials.gov. Additionally, some researchers and system developers use the UMLS to build or enhance electronic resources, clinical data warehouses, components of electronic health record systems, natural language processing pipelines, and test collections. UMLS resources are being used primarily as intended, to facilitate the interpretation of biomedical meaning in disparate electronic information and data in many different computer systems serving scientists, health professionals, and the public.

The Journal of the American Medical Informatics Association is commemorating the 30th UMLS anniversary with a special focus issue dedicated to the memory of Dr. Lindberg (1933–2019) that also includes information on current research and applications, broader impacts, and future directions of the UMLS.

Upon her retirement from NLM in 2017, Betsy Humphreys remarked that “systems that get used, get better.” As the UMLS enters its fourth decade, a review of UMLS production methods and priorities is underway with the same high standard goals with which it started – trailblazing into the future to improve biomedical information storage, processing and retrieval.

As we reflect on this important milestone, we want to thank stakeholders, like you, who have provided feedback over the years to help us make the UMLS leaner, stronger, and more useful.

Top row: David Anderson, UMLS Production Coordinator and Liz Amos, Special Assistant to the Chief Health Data Standards Officer

Bottom Row: Anna Ripple, Information Research Specialist and Patrick McLaughlin, Head, Terminology QA & User Services Unit

Asking the right questions and receiving the most useful answers

Guest post by Lawrence M. Fagan, Associate Director (retired), Biomedical Informatics Training Program, Stanford University.

As online resources proliferate, it becomes harder to figure out which resources—and which parts of those resources—will best answer patients’ questions about their medical care. Patients have access to multiple websites that summarize information about a particular disease, myriad patient communities, and many online research databases.

This resource overload problem isn’t new, however. As more and more data became available in the Intensive Care Units of the 1980s, it grew increasingly difficult to determine the most important measurement to track for optimal care.

In short, more information isn’t necessarily the best solution when it comes to answering patient questions.

After a career in informatics, I now moderate an online community for patients with a particular subtype of lymphoma. Many questions that arise in the group can easily be answered by reviewing existing online content. Health librarians are excellent resources to help guide patients to the correct resources and articles.  However, some queries are less straightforward than others, such as: “What is the one thing you wished you knew before the procedure?” Rather than asking for a recitation of the steps in the procedure, this question is asking what was unexpected—or what step would have benefited from more patient or caregiver preparation. Correspondingly, these types of questions are hard to organize, store, and retrieve from patient-oriented databases.

Sometimes, the community of patients can recognize patterns that escape the notice of medical providers. For example, a lymphoma patient may complain of repeated sinus infections. It’s worth noting that patients often turn to their primary care provider to treat their sinus infections, and those visits may lead to antibiotic prescriptions. In this scenario, group members have pointed out the potential link between treatment with the drug Rituximab and a decrease in the body’s immunoglobulin levels. This connection leads to suggestions to explore an alternative treatment for chronic sinus infections (in this special context) using immunoglobulin replacement therapy rather than antibiotics.

Specialized online communities can also provide help with detailed care issues, including the treatment of side effects for uncommonly used drugs with which local healthcare providers might not be familiar.

Online communities can also suggest researching databases to answer patient questions. ClinicalTrials.gov helps locate experimental treatments for specific medical conditions. Some community discussions about trials go beyond what’s included in the ClinicalTrials.gov database. For instance, group members may discuss the optimal order of clinical trials in a specific medical area, based on an analysis of the inclusion criteria for the various trials. In addition, there are ancillary questions about trial logistics that aren’t found in the database, such as, “I live in the San Francisco area—is it feasible to participate in Trial X at City of Hope in Southern California?” Setting up comprehensive links between the clinical databases and discussions in patient communities would help patients access the answers to their questions more efficiently.

The answers to these specialized questions are often found in the archives of online communities or in the memories of group participants. Yet, it is not easy to find the right community for a particular medical problem, and in my understanding there is no central repository of links to online communities. Moreover, while many community links can be found in MedlinePlus, static links to community websites often become stale, as sites may migrate locations over time. Some of the ACOR cancer communities, for example, have migrated to SmartPatients.com.  As patients find a community of interest, it is important that they determine whether the conversations are ongoing and whether the participants are knowledgeable and supportive. The Mayo Clinic offers a short discussion detailing the pros and cons of support groups.

Researchers have examined patient and clinician information needs for more than a quarter century. These models, however, have only rarely been incorporated into information retrieval systems. One successful example (aimed at providers) is the use of “clinical queries” in PubMed, designed for searching the scientific literature. This brings us to a critical question: What would it take to reengineer the patient-oriented retrieval systems so that these focused queries drive most patient sites?

For now, we have communities of patients and dedicated professionals who are ready and willing to help point to the most useful answers.

Please note: The mention of any commercial or trade name is for information and does not imply endorsement on the part of the author or the National Library of Medicine.

Many thanks to Dave deBronkart, Janet Freeman-Daily, Robin Martinez, Tracie Tavel, and Roni Zeiger who reviewed earlier versions of this blog post.

Outdoor portrait of Lawrence M. Fagan.Lawrence Fagan, MD, PhD, retired in 2012 from his role as Associate Director of the Stanford University Biomedical Informatics Training Program. He is a Fellow of the American College of Medical Informatics. His current interests are in patient engagement, precision health, and preventing medical errors.

Training for Lifelong Learning

To say biomedical informatics is a rapidly changing field might be an understatement. Or a truism. Probably both.

Given its interdisciplinary nature and the myriad ways each of those disciplines is changing, it’s no wonder. From advances in molecular biology to the gigantic leaps we’re making in artificial intelligence and pattern recognition, the fields that feed in to biomedical informatics are speeding forward, so we shouldn’t be surprised they’re driving biomedical informatics forward as well.

Dr. George Hripcsak’s post from last week made this point in the context of biomedical informatics training. Our trainees must be prepared to master what will likely be a never-ending series of new topics and skills, and our training programs must evolve to keep up with them. And while we can’t anticipate every twist or turn, we can prepare our trainees for the road ahead by giving them the skills to navigate change.

NLM is trying to do that.

NLM supports university-based training in biomedical informatics and data science at 16 institutions around the country. That translates into over 200 trainees supported annually.

While the university programs share common elements, in the end each is unique.  They vary in focus, with some emphasizing the informatics related to biological phenomena and others addressing clinical informatics. They also require different levels of course work. But in general, both pre- and postdoctoral trainees in these programs attend classes, participate in research projects, and are mentored to become independent researchers, earning a PhD or a Master’s degree upon completion.

Annually, the predoctoral students, postdoctoral fellows, and the faculty from the 16 university programs NLM supports get together for a two-day meeting. It’s both an honored tradition and a much-valued component of the training process—kind of a networking event crossed with an extended family reunion. In a good way.

The meeting gives trainees the opportunity to develop career-shaping networks, learn about different concentrations in biomedical informatics, and, perhaps most importantly, present posters and podium talks that both hone their scientific communications skills and promote their research. Meanwhile, the training directors and faculty get together to share best practices, discuss curriculum, and offer NLM guidance regarding future training directions and support.

This year’s training meeting—hosted just last week by Vanderbilt University—emerged for the first time from the trainees and fellows themselves.  That is, the Vanderbilt students planned the meeting (with a bit of guidance from their faculty). This shift put the meeting’s structure and content in the hands of those most likely to benefit from them—but also most likely to know what they and their colleagues need to hear.

The outcome exceeded expectations.

The opening student-only social event kicked things off, and the pace never relented. In a good way.

Podium presentations of completed research joined poster presentations of works in progress, 3-3 lightening talks (three slides, three minutes), and small group “birds of a feather” discussions around themes such as interoperability, user experience, and curation.

Regardless of what was happening though, conversations abounded. The social mixing that sometimes took a full day to occur was evident in the first few hours, making those rooms loud! In a good way.

Clearly, peer-directed learning involves a lot of conversation.

When I had the chance to address the group, I pointed out how all that conversation paralleled the careers that lie before them. That is, in such a rapidly changing field, never-ending curiosity and unrelenting inquiry are absolutely essential. Trainees and fellows must be prepared for an ever-changing world and embrace the idea that their current training programs are launch pads, not tool belts. Content mastery will get them only so far.

To respect the public investment in their careers, they must always learn, always question, always engage.

They can’t rush it. Or consider it done. Careers take a lifetime.

And NLM is committed to preparing them for that lifetime of contribution and discovery. After all, those working in a field that is ever-changing must be ever-changing themselves. In a good way.

The Evolution of Data Science Training in Biomedical Informatics

Guest post by Dr. George Hripcsak, Vivian Beaumont Allen Professor and Chair of Columbia University’s Department of Biomedical Informatics and Director of Medical Informatics Services for New York-Presbyterian Hospital/Columbia Campus.

Biomedical informatics is an exciting field that addresses information in biomedicine. At over half a century, it is older than many realize. Looking back, I am struck that in one sense, its areas of interest have remained stable. As a trainee in the 1980s, I published on artificial neural networks, clinical information systems, and clinical information standards. In 2018, I published on deep learning (neural networks), electronic health records (clinical information systems), and terminology standards. I believe this stability reflects the maturity of the field and the difficult problems we have taken on.

On the other hand, we have made enormous progress. In the 1980s we dreamed of adopting electronic health records and the widespread use of decision support fueled by computational techniques. Nowadays we celebrate and bemoan the widespread adoption of electronic health records, although we still look forward to more widespread decision support.

Data science has filled the media lately, and it has been part of biomedical informatics throughout its life. Progress here has been especially notable.

Take the Observational Health Data Sciences and Informatics (OHDSI) project as an example: a billion patient records from about 400 million unique patients, with 200 researchers from 25 countries. This scale would not have been possible in the 1980s. A combination of improved health record adoption, improved clinical data standards, more computing power and data storage, advanced data science methods (regularized regression, Bayesian approaches), and advanced communications have made it possible. For example, you can now look up any side effect on any drug on the world market, review a 17,000-hypotheses study (publication forthcoming) comparing the side effects caused by different treatments for depression, and study how three chronic diseases are actually treated around the world.

How we teach data science in biomedical informatics has also evolved. Take as an example Columbia University’s Department of Biomedical Informatics training program, which has been funded by the National Library of Medicine for about three decades. It initially focused on clinical information systems under its founding chair, Paul Clayton, and while researchers individually worked on what today would be called data science, the curriculum focused heavily on techniques related to clinical information systems. For the first decade, our data science methods were largely pulled in from computer science and statistics courses, with the department focusing on the application of those techniques. During that time, I filled a gap in my own data science knowledge by obtaining a master’s degree in biostatistics.

In the second decade, as presented well by Ted Shortliffe and Stephen Johnson in the 2002 IMIA Yearbook of Medical Informatics, the department shifted to take on a greater responsibility for teaching its own methods, including data science. Our core courses focused on data representation, information systems, formal models, information presentation, decision making, evaluation, and specialization in application tracks. The Methods in Medical Informatics course focused mainly on how to represent knowledge (using Sowa’s 1999 Knowledge Representation textbook), but it also included numeric data science components like Bayesian inference, Markov models, and machine learning algorithms, with the choice between symbolic and statistical approaches to solving problems as a recurring theme. We also relied on computer science and statistics faculty to teach data management, software engineering, and basic statistics.

In the most recent decade, the department expanded its internal focus on data science and made it more explicit, with the content from the original methods course split among three courses: computational methods, symbolic methods, and research methods. The computational methods course covered the numerical methods commonly associated with data science, and the symbolic methods course included the representational structures that support the data.

This expansion into data science continued four years ago when Noemie Elhadad created a data science track  (with supplemental funding from the National Library of Medicine) that encouraged interested students to dive more deeply into data science through additional departmental and external courses. At present, all students get a foundation in data science through the computational methods class and required seminars, and those with additional interest can engage as deeply as any computer science or statistics trainee.

We encourage our students not just to apply data science methods but to develop new methods, including supplying the theoretical foundation for the work. While this may not be for every informatics trainee, we believe that our field must be as rigorous as the methodological fields we pull from. Examples include work on deep hierarchical families by Ranganath, Blei, and colleagues, and remaking survival analysis with Perotte and Elhadad.

To survive, a department must look forward. Our department invested heavily in data science and in electronic health record research in 2007. A decade later, what is on the horizon?

I believe informatics will come full circle, returning at least in part to its physiological modeling origins that predated our department. As we reach the limits of what our noisy and sparse data can provide for deep learning, we will learn to exploit pre-existing biomedical knowledge in different forms of mechanistic models. I believe these hybrid empirical-mechanistic methods can produce patient-specific recommendations and realize the dream of precision medicine. And we have begun to teach our trainees how to do it.

formal headshot of Dr. HripcsakGeorge Hripcsak, MD, MS, is Vivian Beaumont Allen Professor and Chair of Columbia University’s Department of Biomedical Informatics and Director of Medical Informatics Services for New York-Presbyterian Hospital/Columbia Campus. He has more than 25 years of experience in biomedical informatics with a special interest in the clinical information stored in electronic health records and the development of next-generation health record systems. He is an elected member of the Institute of Medicine and an elected fellow of the American College of Medical Informatics and the New York Academy of Medicine. He has published more than 250 papers and previously chaired NLM’s Biomedical Library and Informatics Review Committee.

The Rise of Computational Linguistics Geeks

Guest post by Dina Demner-Fushman, MD, PhD, staff scientist at NLM.

“So, what do you do for a living?”

It’s a natural dinner party question, but my answer can prompt that glazed-over look we all dread.

I am a computational linguist, also known (arguably) as a specialist in natural language processing (NLP), and I work at the National Library of Medicine.

If I strike the right tone of excitement and intrigue, I might buy myself a few minutes to explain.

My work combines computer science and linguistics, and since I focus on biomedical and clinical texts, it also requires adding some biological, medical, and clinical know-how to the mix.

I work specifically in biomedical natural language processing (BioNLP). The definition of BioNLP has varied over the years, with the spotlight shifting from one task to another—from text mining to literature-based discovery to pharmacovigilance, for example—but the core purpose has remained essentially unchanged: training computers to automatically understand natural language to speed discovery, whether in service of research, patient care, or public health.

The field has been around for a while. In 1969 NIH researchers Pratt and Pacak described the early hope for what we now call BioNLP in the paper, “Automated processing of medical English,” which they presented at a computational linguistics conference:

The development of a methodology for machine encoding of diagnostic statements into a file, and the capability to retrieve information meaningfully from [a] data file with a high degree of accuracy and completeness, is the first phase towards the objective of processing general medical text.

NLM became involved in the field shortly thereafter, first with the Unified Medical Language System (UMLS) and later with tools to support text processing, such as MetaMap and TextTool, all of which we’ve improved and refined over the years. The more recent Indexing Initiative combines these tools with other machine learning methods to automatically apply MeSH terms to PubMed journal articles. (A human checks the computer’s work, revising as needed.)

These and NLM’s other NLP developments help improve the Library’s services, but they are also freely shared with the world, broadening our impact but more importantly, helping to handle the global proliferation of scientific and clinical text.

It’s that last piece that makes NLP so hot right now.

NLP, we’re finding, can take in large numbers of documents and locate relevant content, summarize text, apply appropriate descriptors, and even answer questions.

It’s every librarian’s—and every geek’s—dream.

But how can we use it?

Imagine, for example, the ever-expanding volume of health information around patients’ adverse reactions to medications. At least four different—and prolific—content streams feed into that pool of information:

  • the reactions reported in the literature, frequently in pre-market research (e.g., in the results of clinical trials);
  • the labeled reactions, i.e., the reactions described in the official drug labels provided by manufacturers;
  • the reactions noted in electronic health records and clinical progress notes; and
  • the reactions described by patients in social media.

NLM’s work in NLP—and its funding of extramural research in NLP—is helping develop approaches and resources for extracting and synthesizing adverse drug reactions from all four streams, giving a more complete picture of how people across the spectrum are responding to medications.

It’s a challenging task. Researchers must address different vocabularies and language structures to extract the information, but NLP, and my fellow computational linguists, will, I predict, prove up to it.

Now imagine parents seeking health information regarding their sick child.

NLP can answer their question, first by understanding key elements in the incoming question and then by providing a response, either by drawing upon a database of known answers (e.g., FAQs maintained by the NIH institutes) or by summarizing relevant PubMed or MedlinePlus articles. Such quick access to accurate and trustworthy health information has the potential to save time and to save lives.

We’re not fully there yet, but as our research continues, we get closer.

Maybe it’s time I reconsider how I answer that perennial dinner party question: “I’m a computational linguist, and I help improve health.”

headshot of Dr. Demner-FushmanDina  Demner-Fushman, MD, PhD is a staff scientist in NLM’s Lister Hill National Center for Biomedical Communications. She leads research in information retrieval and natural language processing focused on clinical decision-making, answering clinical and consumer health questions, and extracting information from clinical text.