Dr. Isaac Kohane: Making Our Data Work for Us!

Last weekend, Isaac Kohane, MD, PhD, FACMI, Marion V. Nelson Professor of Biomedical Informatics, and Chair of the Department of Biomedical Informatics at Harvard Medical School received the 2020 Morris F. Collen Award of Excellence at the AMIA 2020 Virtual Annual Symposium. This award – the highest honor in informatics – is bestowed to an individual whose personal commitment and dedication to medical informatics has made a lasting impression on the field.

Throughout his career, Dr. Kohane has worked to extract meaning out of large sets of clinical and genomic data to improve health care. His efforts mining medical data have contributed to the identification of harmful side-effects associated with drug therapy, recognition of early warning signs of domestic abuse, and detection of variations and patterns among people with conditions such as autism.

As the lead investigator of the i2b2 (Informatics for Integrating Biology & the Bedside) project, a National Institutes of Health-funded National Center for Biomedical Computing initiative, Dr. Kohane’s work has led to the creation of a comprehensive software and methodological framework to enable clinical researchers to accelerate the translation of genomic and “traditional” clinical findings into novel diagnostics, prognostics, and therapeutics.

Dr. Kohane is a visionary with a motto:  Make Our Data Work for Us! Please join me in congratulating Dr. Kohane, recipient of the 2020 Morris F. Collen Award of Excellence.

Hear more from Dr. Kohane in this video.

Video transcript (below)

The vision that has driven my research agenda is that we were not doing our patients any favors by not embracing information technology to accelerate our ability to both discover new findings in medicine, and to improve the way we deliver the medicine.

What does “make our data work for us” mean? It means that let’s not just use it for the real reason most of it is accumulated at present, which is in order to satisfy administrative or reimbursement processes. Let’s use it to improve health care.

Using just our claims data, we can actually predict – better than genetic tests – recurrence rates for autism. It’s the ability to show, with these same data, that drugs used for preventing immature birth in the genetic form are just as effective as those that are brand name; 40 times as expensive. It’s, as we’ve seen most recently, the ability to pull together data around pandemics within weeks, if and only if, we understand the data that’s spun off our health care systems in the course of care.

And finally, as exemplified by work on FHIR, which was funded by the Office of the National Coordinator and then the National Library Medicine, the ability to flow the data directly to the patient to finally allow patients’ access to their data in a computable format to allow decision support for the patient without going through the long loop of the health care system.

Because the NIH and NLM have invested in working on real-world sized experiments in biomedical informatics, on supporting the education of the individuals who drive those projects, and in supporting the public standards that are necessary for these projects to work and to scale, they’ve established an ecosystem that now is able to deliver true value to decision makers, to clinicians, and now to patients, as we’re seeing with a SMART on FHIR implementation on smartphones.

So, for those of you — the biomedical informaticians of the future who are clinicians — I strongly recommend that you don’t wait for someone else to fix the system. You have the most powerful tools to affect medicine, information processing tools. So, don’t wait to get old. Don’t wait to be recognized. You have the tools. Get in there, help change medicine. We all depend on you!

Making Connections and Enabling Discoverability – Celebrating 30 Years of UMLS

Guest post by NLM staff: David Anderson, UMLS Production Coordinator; Liz Amos, Special Assistant to the Chief Health Data Standards Officer; Anna Ripple, Information Research Specialist; and Patrick McLaughlin, Head, Terminology QA & User Services Unit.

Shortly after Donald A.B. Lindberg, MD was sworn in as NLM Director in 1984, he asked “What is NLM, as a government agency, uniquely positioned to do?” Through conversations with experts, Dr. Lindberg identified a looming question in the field of bioinformatics — How can machines act as if they understand biomedical meaning? At the time, the information necessary to answer this question was distributed across a variety of resources. Very few publicly available tools for processing biomedical text had been developed. NLM had experience with terminology development and maintenance (MeSH – Medical Subject Headings), coordinating distributed systems (DOCLINE), and distributing and providing access to large datasets (MEDLINE) in an era when this was a challenge.

As a national library, NLM was deeply interested in providing good answers to biomedical questions. For these reasons, NLM was uniquely positioned to develop a system — the Unified Medical Language System (UMLS) — that could lay the groundwork for machines to act as if they understand biomedical meaning. This year marks the 30th anniversary of the release of the first edition of the UMLS in November 1990.

Achieving the Unified Medical Language System

The result of a large-scale, NLM-led research and development project, the UMLS began with the audacious goal of helping computer systems behave as if they understand the meaning of the language of biomedicine and health. The UMLS was expected to facilitate the development of systems that could retrieve, integrate, and aggregate conceptually-related information from disparate electronic sources such as literature databases, clinical records, and databanks despite differences in the vocabularies and coding systems used within them, and in the terminology employed by users.  

Betsy Humphreys (left) and Dr. Lindberg (right) tout the release of the Unified Medical Language System in 1990.

Under the direction of Dr. Donald Lindberg, then-Deputy Associate Director for Library Operations, Betsy Humphreys, and a multidisciplinary, international team from academia and the private sector, the UMLS evolved into an essential tool for enabling interoperability, natural language processing, information retrieval, machine learning, and  other data science use cases.

UMLS Knowledge Sources

Central to the UMLS model is the grouping of synonymous names into UMLS concepts and the assignment of broad categories (semantic types) to all those concepts. Since its first release in 1990, NLM has continued to expand and update the UMLS Knowledge Sources based on feedback from testing and use.

The UMLS Metathesaurus was the first biomedical terminology resource organized by concept, and its development had a significant impact on subsequent medical informatics theory and practice. The broad terminology coverage, synonymy, and semantic categorization in the UMLS, in combination with its lexical tools, enable its primary use cases:

  • identifying meaning in text,
  • mapping between vocabularies, and
  • improving information retrieval.

The growing increase in UMLS use over the past decade reflects broad developments in health policy, including the designation of SNOMED CT, LOINC, and RxNorm (three component vocabularies included in the UMLS Metathesaurus) as U.S. national standards for clinical data for quality improvement payment programs such as CMS’s Promoting Interoperability Programs (previously known as Meaningful Use). Many UMLS source vocabularies are also referenced in the United States Core Data for Interoperability (USCDI). Researchers continue to rely on the UMLS as a knowledge base for natural language processing and data mining. The UMLS community of users has developed several tools that enhance and expand the capabilities of the UMLS.

Celebrating 30 Years

Thirty years after the initial release of the UMLS Knowledge Sources, the UMLS resources continue to be of benefit to millions of people worldwide. The UMLS is used in NLM flagship applications such as PubMed and ClinicalTrials.gov. Additionally, some researchers and system developers use the UMLS to build or enhance electronic resources, clinical data warehouses, components of electronic health record systems, natural language processing pipelines, and test collections. UMLS resources are being used primarily as intended, to facilitate the interpretation of biomedical meaning in disparate electronic information and data in many different computer systems serving scientists, health professionals, and the public.

The Journal of the American Medical Informatics Association is commemorating the 30th UMLS anniversary with a special focus issue dedicated to the memory of Dr. Lindberg (1933–2019) that also includes information on current research and applications, broader impacts, and future directions of the UMLS.

Upon her retirement from NLM in 2017, Betsy Humphreys remarked that “systems that get used, get better.” As the UMLS enters its fourth decade, a review of UMLS production methods and priorities is underway with the same high standard goals with which it started – trailblazing into the future to improve biomedical information storage, processing and retrieval.

As we reflect on this important milestone, we want to thank stakeholders, like you, who have provided feedback over the years to help us make the UMLS leaner, stronger, and more useful.

Top row: David Anderson, UMLS Production Coordinator and Liz Amos, Special Assistant to the Chief Health Data Standards Officer

Bottom Row: Anna Ripple, Information Research Specialist and Patrick McLaughlin, Head, Terminology QA & User Services Unit

Fostering a Culture of Scientific Data Stewardship

Guest post by Jerry Sheehan, Deputy Director, National Library of Medicine.

Making research data broadly findable, accessible, interoperable, and reusable is essential to advancing science and accelerating its translation into knowledge and innovation. The global response to COVID-19 highlights the importance and benefits of sharing research data more openly.

The National Institutes of Health (NIH) has long championed policies that make the results of research available to the public. Last week, NIH released the NIH Policy for Data Management and Sharing (DMS Policy) to promote the management and sharing of scientific data generated from NIH-funded or conducted research. This policy replaces the 2003 NIH Data Sharing Policy.

The DMS policy was informed by public feedback and requires NIH-funded researchers to plan for the management and sharing of scientific data. It also makes clear that data sharing is a fundamental part of the research process.

Data sharing benefits the scientific community and the public.

For the scientific community, data sharing enables researchers to validate scientific results, increasing transparency and accountability. Data sharing also strengthens collaborations that allow for richer analyses. Strong data-sharing practices facilitate the reuse of hard-to-generate data, such as those acquired during complex experiments or once-in-a-lifetime events like natural disasters or pandemics.

For the public, sound data-sharing practices demonstrate good stewardship of taxpayer funds. Clear, well-written data sharing and management plans promote transparency and accountability to society. They also expand opportunities for data to be access and reused by clinicians, students, educators, and innovators in health care and other sectors of the economy.

As an organization dedicated to improving access to data and information to advance biomedical sciences and public health, NLM plays a key role in implementing the new policy and supporting researchers in meeting its requirements. NLM maintains a number of data repositories, such as the Sequence Read Archive and ClinicalTrials.gov, that curate, preserve, and provide access to research data. NLM also maintains a longer list of NIH-supported data repositories that accept different types of data (e.g., genomic, imaging) from different research domains (e.g., cancer, neuroscience, behavioral sciences). Where appropriate domain-specific repositories do not exist, NLM has made clear how researchers can include small datasets (<2GB) with articles deposited in NLM’s PubMed Central (PMC) under the NIH Public Access Policy.

NLM also works with the broader library community to support improved data management and sharing. Supplemental information issued with the new policy makes it clear that research budgets can include costs of data management and sharing, such as those for data curation, formatting data to accepted standards, attaching metadata to foster discoverability, and preparing data for storage in a repository. These are the kinds of services increasingly provided by libraries and librarians in universities and academic medical centers across the country. NLM, through the Network of the National Library of Medicine, offers training in data management and data literacy to health science, public, and other librarians to expand capacity for these important services.

NIH’s DMS Policy applies to all research, funded or conducted in whole or in part by NIH, that results in the generation of scientific data. This includes research funded or conducted by extramural grants, contracts, intramural research projects, or other funding agreements. The DMS Policy does not apply to research and other activities that do not generate scientific data, including training, infrastructure development, and non-research activities.

NIH will continue to engage the research community to support the change and implementation of this new policy, which will go into effect in January 2023. NLM will continue to work within NIH and across the library and information science communities to develop innovative ways to support the policy and advance the effective stewardship of research data. Let us know how else we can support this important policy advance.

Read more about this major policy release in the NIH’s Under the Poliscope blog.

As NLM Deputy Director, Jerry Sheehan shares responsibility with the Director for overall program development, program evaluation, policy formulation, direction and coordination of all Library activities. He has made major contributions to the development and implementation of NIH, HHS, and U.S. government-wide policy related to open science, public access to government-funded information, clinical trials registration, and electronic health records.

Introducing the NIH Guide Notice Encouraging Researchers to Adopt U.S. Core Data for Interoperability Standard

Recently, NIH issued a guide notice (NOT-OD-20-146) encouraging NIH-supported clinical programs and researchers to adopt and use the standardized set of healthcare data classes, data elements, and associated vocabulary standards in the U.S. Core Data for Interoperability (USCDI) standard. This standard will make it easier to exchange health information for research and clinical care, and is required under the Office of the National Coordinator Health Information Technology (ONC) Cures Act Final Rule to support seamless and secure access, exchange, and use of electronic health information.

USCDI standardizes health data classes and data elements that make sharing health information across the country interoperable, expands on data long required to be supported by certified EHRs, and incorporates health data standards developed.

NLM is proud to support USCDI through continued efforts to establish and maintain clinical terminology standards within the Department of Health and Human Services.

Standardized health data classes and elements enable collaboration, make it easier to aggregate research data, and enhance the discoverability of groundbreaking research. USCDI adoption will allow care delivery and research organizations to use the same coding systems for key data elements that are part of the USCDI data classes.

I encourage you to read more about the new guide notice in a joint post developed in collaboration with my NIH and ONC colleagues titled: “Leveraging Standardized Clinical Data to Advance Discovery.” And I ask you to consider, what could this notice mean for you? 

Some Insights on the Roles and Uses of Generalist Repositories

Guest post by Susan Gregurick, PhD, Associate Director for Data Science and Director, Office of Data Science Strategy, NIH

Data repositories are a useful way for researchers to both share data and make their data more findable, accessible, interoperable, and reusable (that is, aligned with the FAIR Data Principles).

Generalist repositories can house a vast array of data. This kind of repository does not restrict data by type, format, content, or topic. NIH has been exploring the roles and uses of generalist repositories in our data repository landscape through three activities, which I describe below, garnering valuable insights over the last year.

A pilot project with a generalist repository

NIH Figshare archive

Last September, I introduced Musings readers to the one-year Figshare pilot project, which was recently completed. Information about the NIH Figshare instance — and the outcomes of the project — is available on the Office of Data Science Strategy’s website. This project gave us an opportunity to uncover how NIH-funded researchers might utilize a generalist repository’s existing features. It also allowed us to test some specific options, such as a direct link to grant information, expert guidance, and metadata improvements.

There are three key takeaways from the project:

  • Generalist repositories are growing. More researchers are depositing data in, and more publications are linking to, generalist repositories.
  • Researchers need more education and guidance on where to publish data and how to effectively describe datasets using detailed metadata.
  • Better metadata enables greater discoverability. Expert metadata review proved to be one of the most impactful and unique features of the pilot instance, which we determined through two key metrics. When compared to data uploaded to the main Figshare repository by NIH-funded investigators, the NIH Figshare instance had files with more descriptive titles (e.g., twice as long) and metadata descriptions that were more than three times longer.
Illustrating how professionals can identify opportunities for collaboration and competition.

The NIH Figshare instance is now an archive, but the data are still discoverable and reusable. Although this specific pilot has concluded, we encourage NIH-funded researchers to use a generalist repository that meets the White House Office of Science and Technology Policy criteria when a domain-specific or institutional repository is not available.

A community workshop on the role of generalist repositories

In February, the Office of Data Science Strategy hosted the NIH Workshop on the Role of Generalist and Institutional Repositories to Enhance Data Discoverability and Reuse, bringing together representatives of generalist and institutional repositories for a day and a half of rich discussion. The conversations centered around the concept of “coopetition,” the importance of people in the broader data ecosystem, and the importance of code. A full workshop summary is available, and our co-chairs and the workshop’s participating generalist repositories recently published a generalist repository comparison chart as one of the outcomes of this event.

We plan to keep engaging with this community to better enable coopetition among repositories while working collaboratively with repositories to ensure that researchers can share data effectively.

An independent assessment of the generalist repository landscape

We completed an independent assessment to understand the generalist repository landscape, discover where we were in tune with the community, and identify our blind spots. Key findings include the following:

  • There is a clear need for the services that generalist repositories provide.
  • Many researchers currently view generalist repository platforms as a place to deposit their own data, rather than a place to find and reuse other people’s data.
  • Repositories and researchers alike are looking to NIH to define its data sharing requirements, so each group knows what is expected of them.
  • The current lack of recognition and rewards for data sharing helps reinforce the focus on publications as the key metric of scientific output and therefore may be a disincentive to data sharing.

The pilot, workshop, and assessment provided us with a deeper understanding of the repository landscape.

We are committed to advancing progress in this important area of the data ecosystem of which we are all a part. We are currently developing ways to continue fostering coopetition among generalist repositories; strategies for increasing engagement with researchers, institutional repositories, and data librarians; and opportunities to better educate the biomedical research community on the value of effective data management and sharing.

The Office of Data Science Strategy will announce specific next steps in the near future. In the meantime, we invite you to share your ideas with us at datascience@nih.gov.

Dr. Gregurick leads the implementation of the NIH Strategic Plan for Data Science through scientific, technical, and operational collaboration with the institutes, centers, and offices that make up NIH. She has substantial expertise in computational biology, high performance computing, and bioinformatics.