Training for Lifelong Learning

To say biomedical informatics is a rapidly changing field might be an understatement. Or a truism. Probably both.

Given its interdisciplinary nature and the myriad ways each of those disciplines is changing, it’s no wonder. From advances in molecular biology to the gigantic leaps we’re making in artificial intelligence and pattern recognition, the fields that feed in to biomedical informatics are speeding forward, so we shouldn’t be surprised they’re driving biomedical informatics forward as well.

Dr. George Hripcsak’s post from last week made this point in the context of biomedical informatics training. Our trainees must be prepared to master what will likely be a never-ending series of new topics and skills, and our training programs must evolve to keep up with them. And while we can’t anticipate every twist or turn, we can prepare our trainees for the road ahead by giving them the skills to navigate change.

NLM is trying to do that.

NLM supports university-based training in biomedical informatics and data science at 16 institutions around the country. That translates into over 200 trainees supported annually.

While the university programs share common elements, in the end each is unique.  They vary in focus, with some emphasizing the informatics related to biological phenomena and others addressing clinical informatics. They also require different levels of course work. But in general, both pre- and postdoctoral trainees in these programs attend classes, participate in research projects, and are mentored to become independent researchers, earning a PhD or a Master’s degree upon completion.

Annually, the predoctoral students, postdoctoral fellows, and the faculty from the 16 university programs NLM supports get together for a two-day meeting. It’s both an honored tradition and a much-valued component of the training process—kind of a networking event crossed with an extended family reunion. In a good way.

The meeting gives trainees the opportunity to develop career-shaping networks, learn about different concentrations in biomedical informatics, and, perhaps most importantly, present posters and podium talks that both hone their scientific communications skills and promote their research. Meanwhile, the training directors and faculty get together to share best practices, discuss curriculum, and offer NLM guidance regarding future training directions and support.

This year’s training meeting—hosted just last week by Vanderbilt University—emerged for the first time from the trainees and fellows themselves.  That is, the Vanderbilt students planned the meeting (with a bit of guidance from their faculty). This shift put the meeting’s structure and content in the hands of those most likely to benefit from them—but also most likely to know what they and their colleagues need to hear.

The outcome exceeded expectations.

The opening student-only social event kicked things off, and the pace never relented. In a good way.

Podium presentations of completed research joined poster presentations of works in progress, 3-3 lightening talks (three slides, three minutes), and small group “birds of a feather” discussions around themes such as interoperability, user experience, and curation.

Regardless of what was happening though, conversations abounded. The social mixing that sometimes took a full day to occur was evident in the first few hours, making those rooms loud! In a good way.

Clearly, peer-directed learning involves a lot of conversation.

When I had the chance to address the group, I pointed out how all that conversation paralleled the careers that lie before them. That is, in such a rapidly changing field, never-ending curiosity and unrelenting inquiry are absolutely essential. Trainees and fellows must be prepared for an ever-changing world and embrace the idea that their current training programs are launch pads, not tool belts. Content mastery will get them only so far.

To respect the public investment in their careers, they must always learn, always question, always engage.

They can’t rush it. Or consider it done. Careers take a lifetime.

And NLM is committed to preparing them for that lifetime of contribution and discovery. After all, those working in a field that is ever-changing must be ever-changing themselves. In a good way.

The Evolution of Data Science Training in Biomedical Informatics

Guest post by Dr. George Hripcsak, Vivian Beaumont Allen Professor and Chair of Columbia University’s Department of Biomedical Informatics and Director of Medical Informatics Services for New York-Presbyterian Hospital/Columbia Campus.

Biomedical informatics is an exciting field that addresses information in biomedicine. At over half a century, it is older than many realize. Looking back, I am struck that in one sense, its areas of interest have remained stable. As a trainee in the 1980s, I published on artificial neural networks, clinical information systems, and clinical information standards. In 2018, I published on deep learning (neural networks), electronic health records (clinical information systems), and terminology standards. I believe this stability reflects the maturity of the field and the difficult problems we have taken on.

On the other hand, we have made enormous progress. In the 1980s we dreamed of adopting electronic health records and the widespread use of decision support fueled by computational techniques. Nowadays we celebrate and bemoan the widespread adoption of electronic health records, although we still look forward to more widespread decision support.

Data science has filled the media lately, and it has been part of biomedical informatics throughout its life. Progress here has been especially notable.

Take the Observational Health Data Sciences and Informatics (OHDSI) project as an example: a billion patient records from about 400 million unique patients, with 200 researchers from 25 countries. This scale would not have been possible in the 1980s. A combination of improved health record adoption, improved clinical data standards, more computing power and data storage, advanced data science methods (regularized regression, Bayesian approaches), and advanced communications have made it possible. For example, you can now look up any side effect on any drug on the world market, review a 17,000-hypotheses study (publication forthcoming) comparing the side effects caused by different treatments for depression, and study how three chronic diseases are actually treated around the world.

How we teach data science in biomedical informatics has also evolved. Take as an example Columbia University’s Department of Biomedical Informatics training program, which has been funded by the National Library of Medicine for about three decades. It initially focused on clinical information systems under its founding chair, Paul Clayton, and while researchers individually worked on what today would be called data science, the curriculum focused heavily on techniques related to clinical information systems. For the first decade, our data science methods were largely pulled in from computer science and statistics courses, with the department focusing on the application of those techniques. During that time, I filled a gap in my own data science knowledge by obtaining a master’s degree in biostatistics.

In the second decade, as presented well by Ted Shortliffe and Stephen Johnson in the 2002 IMIA Yearbook of Medical Informatics, the department shifted to take on a greater responsibility for teaching its own methods, including data science. Our core courses focused on data representation, information systems, formal models, information presentation, decision making, evaluation, and specialization in application tracks. The Methods in Medical Informatics course focused mainly on how to represent knowledge (using Sowa’s 1999 Knowledge Representation textbook), but it also included numeric data science components like Bayesian inference, Markov models, and machine learning algorithms, with the choice between symbolic and statistical approaches to solving problems as a recurring theme. We also relied on computer science and statistics faculty to teach data management, software engineering, and basic statistics.

In the most recent decade, the department expanded its internal focus on data science and made it more explicit, with the content from the original methods course split among three courses: computational methods, symbolic methods, and research methods. The computational methods course covered the numerical methods commonly associated with data science, and the symbolic methods course included the representational structures that support the data.

This expansion into data science continued four years ago when Noemie Elhadad created a data science track  (with supplemental funding from the National Library of Medicine) that encouraged interested students to dive more deeply into data science through additional departmental and external courses. At present, all students get a foundation in data science through the computational methods class and required seminars, and those with additional interest can engage as deeply as any computer science or statistics trainee.

We encourage our students not just to apply data science methods but to develop new methods, including supplying the theoretical foundation for the work. While this may not be for every informatics trainee, we believe that our field must be as rigorous as the methodological fields we pull from. Examples include work on deep hierarchical families by Ranganath, Blei, and colleagues, and remaking survival analysis with Perotte and Elhadad.

To survive, a department must look forward. Our department invested heavily in data science and in electronic health record research in 2007. A decade later, what is on the horizon?

I believe informatics will come full circle, returning at least in part to its physiological modeling origins that predated our department. As we reach the limits of what our noisy and sparse data can provide for deep learning, we will learn to exploit pre-existing biomedical knowledge in different forms of mechanistic models. I believe these hybrid empirical-mechanistic methods can produce patient-specific recommendations and realize the dream of precision medicine. And we have begun to teach our trainees how to do it.

formal headshot of Dr. HripcsakGeorge Hripcsak, MD, MS, is Vivian Beaumont Allen Professor and Chair of Columbia University’s Department of Biomedical Informatics and Director of Medical Informatics Services for New York-Presbyterian Hospital/Columbia Campus. He has more than 25 years of experience in biomedical informatics with a special interest in the clinical information stored in electronic health records and the development of next-generation health record systems. He is an elected member of the Institute of Medicine and an elected fellow of the American College of Medical Informatics and the New York Academy of Medicine. He has published more than 250 papers and previously chaired NLM’s Biomedical Library and Informatics Review Committee.

Reflections on the Work of the Research Data Alliance

The Research Data Alliance (RDA) is a community-driven, interdisciplinary, international organization dedicated to collaboratively building the social and technical infrastructure necessary for wide-scale data sharing and advancing open science initiatives. Just short of five years old, this group gathers twice a year at plenary meetings, the most recent just last week.

These are no big-lecture, hallway-conversation meetings. As I discovered in Berlin last week, they are working meetings, in the best sense of the phrase—where the work involves creating and validating the mechanisms and standards for data sharing. That work is done by volunteers from across disciplines—over 7,000 people engaged in small work groups, local activities, and conference-based sessions. These volunteers deliberate and construct standards for data sharing, and then establish strategies for testing and endorsing these standards and gaining community consensus and adoption—including partnering with notable standard-setting bodies such as ISO or IEEE.

Much of the work focuses on making data and data repositories FAIR— Findable, Accessible, Interoperable, and Reusable—which is something I’ve talked a lot about in this blog.

But RDA espouses a broader vision than the approach NLM has taken so far with data. Where we provide public access to full-text articles, some of which link to associated data, RDA advocates for putting all research-generated data in domain-specific, well-curated repositories.

To achieve that vision, RDA members are working to develop the following three key elements:

  • a schema to link data to articles,
  • a mechanism for citing data extracts, and
  • a way to recognize high-quality data repositories.

Right now, a single publisher may have 50 or 60 different ways of linking articles to data. That means that the estimated 25,000 publishers and 5,000 repositories that manage data have potentially millions of ways of accomplishing this task. Instituting a standardized schema to link data to articles would bring significant order and discoverability to this overwhelming diversity. That consistency would yield immediate benefits, tops among them making data findable and the links interoperable.

Efficient data citations will also be a boon to findability. RDA is working on developing dynamic data citations, which would provide persistent identifiers tying data extracts to their repositories and tracking different versions of the data. Machine-created and machine-readable, data citations would enhance rigor and reproducibility in research by ensuring the data generated in support of key findings remains accessible.

But linking to and tracking data won’t get us far if the data itself is untrustworthy.

To address that, RDA encourages well-curated repositories, but what exactly does that mean?

Certification provides one way of acknowledging the quality of a repository. RDA doesn’t sponsor a certification mechanism, but it recognizes several, including the CoreTrustSeal program.  (For more on data certification, see “A Primer on the Certifications of a Trusted Digital Repository,” by Dawei Lin from the NIH National Institute of Allergy and Infectious Diseases.)

But why does all this matter to NIH and to NLM specifically?

I came to the RDA meeting to explore complementary approaches to what NLM is already doing to curate and assign metadata to data. I was especially looking for guidance on how to handle new data types such as images and environmental exposures.

I got some of that, but I also learned that NLM has much to contribute to RDA’s work. Particularly given our expertise in clinical terminologies and literature languages, we add rich depth to the ways data and other resources can be characterized.

In addition, I learned that we at NLM and NIH face many of the same challenges as our global partners: efficiently managing legacy data while not constraining the future to the problems of the past; fostering the adoption of common approaches and standards when the benefit to the larger scientific community may be greater than the value to the individual investigator; coordinating a voluntary, community-led process that has mission-critical consequences; and creating a permanent home and support organization for the wide range of standards actually needed for data-driven discovery.

Finally, I learned that people participate in the work of RDA because it both draws on their expertise and advances their own scholarly efforts. In other words, it’s mutually beneficial. But after my time with the group last week, I suspect we all get more than we give. For NLM anyway—as we begin to implement our new strategic plan—RDA’s goal of creating a global data ecosystem of best practices, standards, and interoperable data infrastructures is encouraging and something to look forward to.

Models: The Third Leg in Data-Driven Discovery

Considering a library of models

George Box, a famous statistician, once remarked, “All models are wrong, and some are useful.”

As representations or approximations of real-world phenomena, models, when done well, can be very useful.  In fact, they serve as the third leg to the stool that is data-driven discovery, joining the published literature and its underlying data to give investigators the materials necessary to explore important dynamics in health and biomedicine.

By isolating and replicating key aspects within complex phenomena, models help us better understand what’s going on and how the pieces or processes fit together.

Because of the complexity within biomedicine, health care research must employ different kinds of models, depending on what’s being looked at.

Regardless of the type used, however, models take time to build, because the model builder must first understand the elements of the phenomena that must be represented. Only then can she select the appropriate modeling tools and build the model.

Tracking and storing models can help with that.

Not only would tracking models enable re-use—saving valuable time and money—but doing so would enhance the rigor and reproducibility of the research itself by giving scientists the ability to see and test the methodology behind the data.

Enter libraries.

As we’ve done for the literature, libraries can help document and preserve models and make them discoverable.

The first step in that is identifying and collecting useful models.

Second, we’d have to apply metadata to describe the models. Among the essential elements to include in such descriptions might be model type, purpose, key underlying assumptions, referent scale, and indicators of how and when the model was used.

screencapture with the DOI and RRIDs highlighted
The DOI and RRIDs in a current PubMed record.
(Click to enlarge.)

We’d then need to apply one or more unique identifiers to help with curation. Currently, two different schema provide principled ways to identify models: the Digital Object Identifier (DOI) and the Research Resource Identifier (RRID). The former provides a persistent, unique code to track an item or entity at an overarching level (e.g., an article or book).  The latter documents the main resources used to produce the scientific findings in that article or book (e.g., antibodies, model organisms, computational models).

Just as clicking on an author’s name in PubMed can bring up all the articles he or she has written, these interoperable identifiers, once assigned to research models, make it possible to connect the studies employing those models.  Effectively, these identifiers can tie together the three components that underpin data-driven discovery—the literature, the supporting data, and the analytical tools—thus enhancing discoverability and streamlining scientific communication.

NLM’s long-standing role in collecting, organizing, and making available the biomedical literature positions us well to take on the task of tracking research models, but is that something we should do?

If so, what might that library of models look like? What else should it include? And how useful would this library of models be to you?

Photo credit (stool, top): Doug Belshaw [Flickr (CC BY 2.0) | erased text from original]

The Rise of Computational Linguistics Geeks

Guest post by Dina Demner-Fushman, MD, PhD, staff scientist at NLM.

“So, what do you do for a living?”

It’s a natural dinner party question, but my answer can prompt that glazed-over look we all dread.

I am a computational linguist, also known (arguably) as a specialist in natural language processing (NLP), and I work at the National Library of Medicine.

If I strike the right tone of excitement and intrigue, I might buy myself a few minutes to explain.

My work combines computer science and linguistics, and since I focus on biomedical and clinical texts, it also requires adding some biological, medical, and clinical know-how to the mix.

I work specifically in biomedical natural language processing (BioNLP). The definition of BioNLP has varied over the years, with the spotlight shifting from one task to another—from text mining to literature-based discovery to pharmacovigilance, for example—but the core purpose has remained essentially unchanged: training computers to automatically understand natural language to speed discovery, whether in service of research, patient care, or public health.

The field has been around for a while. In 1969 NIH researchers Pratt and Pacak described the early hope for what we now call BioNLP in the paper, “Automated processing of medical English,” which they presented at a computational linguistics conference:

The development of a methodology for machine encoding of diagnostic statements into a file, and the capability to retrieve information meaningfully from [a] data file with a high degree of accuracy and completeness, is the first phase towards the objective of processing general medical text.

NLM became involved in the field shortly thereafter, first with the Unified Medical Language System (UMLS) and later with tools to support text processing, such as MetaMap and TextTool, all of which we’ve improved and refined over the years. The more recent Indexing Initiative combines these tools with other machine learning methods to automatically apply MeSH terms to PubMed journal articles. (A human checks the computer’s work, revising as needed.)

These and NLM’s other NLP developments help improve the Library’s services, but they are also freely shared with the world, broadening our impact but more importantly, helping to handle the global proliferation of scientific and clinical text.

It’s that last piece that makes NLP so hot right now.

NLP, we’re finding, can take in large numbers of documents and locate relevant content, summarize text, apply appropriate descriptors, and even answer questions.

It’s every librarian’s—and every geek’s—dream.

But how can we use it?

Imagine, for example, the ever-expanding volume of health information around patients’ adverse reactions to medications. At least four different—and prolific—content streams feed into that pool of information:

  • the reactions reported in the literature, frequently in pre-market research (e.g., in the results of clinical trials);
  • the labeled reactions, i.e., the reactions described in the official drug labels provided by manufacturers;
  • the reactions noted in electronic health records and clinical progress notes; and
  • the reactions described by patients in social media.

NLM’s work in NLP—and its funding of extramural research in NLP—is helping develop approaches and resources for extracting and synthesizing adverse drug reactions from all four streams, giving a more complete picture of how people across the spectrum are responding to medications.

It’s a challenging task. Researchers must address different vocabularies and language structures to extract the information, but NLP, and my fellow computational linguists, will, I predict, prove up to it.

Now imagine parents seeking health information regarding their sick child.

NLP can answer their question, first by understanding key elements in the incoming question and then by providing a response, either by drawing upon a database of known answers (e.g., FAQs maintained by the NIH institutes) or by summarizing relevant PubMed or MedlinePlus articles. Such quick access to accurate and trustworthy health information has the potential to save time and to save lives.

We’re not fully there yet, but as our research continues, we get closer.

Maybe it’s time I reconsider how I answer that perennial dinner party question: “I’m a computational linguist, and I help improve health.”

headshot of Dr. Demner-FushmanDina  Demner-Fushman, MD, PhD is a staff scientist in NLM’s Lister Hill National Center for Biomedical Communications. She leads research in information retrieval and natural language processing focused on clinical decision-making, answering clinical and consumer health questions, and extracting information from clinical text.