How NIH Is Using Artificial Intelligence To Improve Operations

Artificial intelligence (AI) is everywhere, from the online marketplace to the laboratory! When you read an article or shop online, the experience is probably supported by AI. And scientists are applying AI methods to find indications of disease, to design experiments, and to make discovery processes more efficient.

The National Institutes of Health (NIH) has been using AI to improve science and health, too, but it’s also using AI in other ways.

Earlier this fall, the White House Office of Science and Technology Policy hosted a summit to highlight ways that the Federal Government uses AI to achieve its mission and improve services to the American people. I was proud to represent NIH and provide examples of how AI is being used to make NIH more effective and efficient in its work.

For example, each year NIH faces the challenge of assigning the more than 80,000 grant applications it receives to the proper review group.

Here’s how the process works now:  Applications that address specific funding opportunity announcements are assigned directly by division directors. Then the Integrated Review Groups (clusters of study sections grouped around general scientific areas) assign the applications to the correct division or scientific branch. A triage officer handles assignments without an identified liaison. This process takes several weeks and may involve passing an application through multiple staff reviews.

Staff at NIH’s National Institute of General Medical Sciences (NIGMS) creatively addressed this challenge by developing and deploying natural language processing and machine learning to automate the process for their Institute. This approach uses a machine learning algorithm, trained on historical data, to find a relationship between the text (title, abstract, and specific aims) and the scientific research area of an application. The trained algorithm can then determine the most likely scientific area of a new application and automatically assign it a program officer who is a subject matter expert in that area.

The new process works impressively well, with 92% of applications referred to the correct scientific division and 84% assigned to the correct program officer, matching the accuracy rate routinely achieved by manual referrals. This change has resulted in substantial time savings, reducing the process from two to three weeks to less than one day. The new approach ensures the efficient and consistent referral of grant applications and liberates program officers from the labor-intensive and monotonous manual referral process, allowing them to focus on higher-value work. It even allows for related institutional knowledge to be retained after staff departures. NIGMS is currently working with the NIH electronic Research Administration (eRA) to incorporate the process into the enterprise database for NIH-wide use.

Now for a second example that’s more pertinent to NLM.

Our PubMed repository receives over 1.2 million new citations each year, and over 2.3 million people conduct about 2.5 million searches using PubMed every day. An average query returns hundreds to thousands of results presented in reverse chronological order of the date the record is added. Yet our internal process-monitoring determined that 80% of the people using PubMed do not go beyond the first page of results, a behavior also seen in general web searches. This means that even if a more relevant citation is on page 4 or page 18, the user may never know.

Zhiyong Lu, PhD and his team from NLM’s National Center for Biotechnology Information applied machine learning strategies to improve the way PubMed presents search results. Their goals were to increase the effectiveness of PubMed searches by helping users efficiently find the most relevant and high-quality information and improve usability and the user experience through a focus on the literature search behaviors and needs of users. Their approach is called the Best Match algorithm, and the technical details can be found in a paper by Fiorini N, Canese K, Starchenko G, et al., PLoS Biol. 2018.

The Best Match algorithm works like this:  In preparation for querying, all articles in the PubMed repository are tagged with key information and metadata, including the publication date, and with an indicator of how often the article has been returned and accessed by previous searches, as part of a model-training process called Learning-to-Rank (L2R). Then, when a user enters a query phrase in the search box on the PubMed website, the phrase is mapped using the PubMed syntax, and the search is launched. In a traditional search, the results are selected based on keyword matching and are presented in reverse chronological order. Through Best Match, the top 500 results—returned via a classic term-weighting algorithm—are re-sorted according to dozens of features of the L2R algorithm, including the past usage of an article, publication date, relevance score, and type of article. At the top of the page, the search results are clearly marked as being sorted by “Best Match.”

Image showing preparing and refining preparing, matching, ranking and refining articles in NLM's PubMed
Picture by Donald Bliss of NLM

Articles prepared prior to user searches; 1.Queries changed to PubMed syntax; 2.Initial Matching Hits presented in reverse chronological order; 3.Results are re-sorted according to the L2R algorithm to present the Best Match; and 4.The L2R algorithm is updated based on user top choices

This new approach will become the core of a new implementation of PubMed, due out by the spring of 2020.

In addition to the examples I described above, NIH is exploring other ways to use AI. For example, AI can help determine whether the themes of research projects align with the stated priorities of a specific Institute, and it can provide a powerful tool to accelerate government business practices. Because of AI’s novelty, NIH engages in many steps to validate the results of these new approaches, ensuring that unanticipated problems do not arise.

In future posts, I look forward to sharing more about how NIH is improving operations through innovative analytics.

Addressing Social Determinants of Health with FHIR Technology

Guest post by Clem McDonald, MD, Chief Health Data Standards Officer at NLM; Jessica Tenenbaum, PhD, Chief Data Officer for North Carolina’s Department of Health and Human Services; and Liz Amos, MLIS, Special Assistant to the Chief Health Data Standards Officer at NLM.

We all know that whether you get an annual flu shot or smoke affects your health. But nonmedical social and economic factors are also large influences on health. For example, individuals will struggle to control their diabetes if they can’t afford healthy food or are sleeping on the street. Healthy People 2020 describes such circumstances as social determinants of health (SDOH). As our health system shifts to value-based payments models, health care systems are prioritizing outcomes, such as the level of glucose control, rather than how much care is delivered (e.g., the number of visits or tests). To achieve better health outcomes, leading organizations are working to identify and address SDOH needs as well as medical needs.

The North Carolina Department of Health and Human Services (NCDHHS) Healthy Opportunities program identifies four priority domains of non-medical needs that can be detected using the answers to screening questions. Screening for needs in these domains will be a standard operating procedure for all Medicaid beneficiaries as the state transitions its Medicaid program to managed care from fee-for-service. Health care providers will be able to refer individuals to community resources such as food pantries, homeless shelters, transportation services, interpersonal violence counselors, and other services that can address some of these nonmedical needs, and the organizations can then be reimbursed for approved services under Medicaid. A computer-based “closed-loop” referral system will enable the collection of information from social service organizations about the services provided, allowing NCDHHS to facilitate reimbursement, monitor the program, and assess its effectiveness. Electronic systems like the one being used in North Carolina are essential to capturing answers to the SDOH screening questions, triaging individuals to appropriate community resources for intervention, and tracking the effects of those interventions. North Carolina is building a “learning” Department of Health and Human Services, similar to a learning health system, with data collected through services provided used to inform future policy decisions.

The SDOH needs being addressed in North Carolina exist across the country, so there is considerable interest in developing standards-based systems for capturing SDOH data anywhere in the United States without the need for separate development efforts at each stage. A powerful mechanism called Fast Healthcare Interoperability Resources®, or FHIR®, has emerged to enable standardization across a broad spectrum of health care processes. Developed by Health Level Seven International, FHIR is a modern, web-based technology for exchanging health care data that has strong and growing support from various stakeholders in the field of health care, including major electronic health record vendors; the tech industry, including Apple, Microsoft, Google, and Amazon; and federal agencies such as NIH, the Office of the National Coordinator for Health Information Technology, the Centers for Medicare and Medicaid Services, the Food and Drug Administration, and the Agency for Healthcare Research and Quality. NCDHHS is exploring the use of a FHIR-based data-capture tool for collecting SDOH information about nonmedical health needs and delivering the survey results to health care providers who can address the needs identified.

Created in the spirit of collaboration, NLM’s FHIR questionnaire app — an open-source tool that can be used, modified, or incorporated into existing tools by anyone — instantly converts a questionnaire that follows FHIR’s technical specifications into a live web form. It leverages the FHIR standard to collect questionnaire data, and generating a different form is just a matter of feeding the tool a different set of questions. FHIR forms can implement skip logic, the nesting of repeated groups of questions, calculations, validation checks, the repopulation of questions with answers from the individual’s FHIR medical record, and more. Of course, the same tool can also implement many other kinds of forms for capturing health care data, such as surveys that measure patient-reported outcomes. You can search more than 2,000 available questionnaires in NLM’s FHIR questionnaire demo app. Other NLM-developed, open-source FHIR-based tools for managing health care data are available here.

NLM and NCDHHS have worked together to develop an open-source, FHIR-based implementation of North Carolina’s Healthy Opportunities screening questions (see figure 1). Anyone with a FHIR-ready server will be able to download the form, enter data, and then route those data to the appropriate health information technology system.

Let’s get to work screening patients broadly while minimizing clinical documentation burdens through the use of standardized application programming interfaces!

 

Figure 1: North Carolina Department of Health and Human Services (NCDHHS)’s Social Determinants of Health (SDOH) Screening Form as a live FHIR Questionnaire demo.
Figure 1: North Carolina Department of Health and Human Services (NCDHHS)’s Social Determinants of Health (SDOH) Screening Form as a live FHIR Questionnaire demo.

 


Clem McDonald, MD

Clem McDonald, MD, is the Chief Health Data Standards Officer at NLM. In this role, he coordinates standards efforts across NLM and NIH, including the FHIR interoperability standard and vocabularies specific to clinical care (LOINC, SNOMED CT, and RxNorm). Dr. McDonald developed one of the nation’s first electronic medical record systems and the first community-wide clinical data repository, the Indiana Network for Patient Care. Dr. McDonald previously served 12 years as Director of the Lister Hill National Center for Biomedical Communications and as scientific director of its intramural research program.

Jessica Tenenbaum, PhD

Jessica Tenenbaum, PhD, is the Chief Data Officer for North Carolina’s Department of Health and Human Services. In this role, Dr. Tenenbaum is responsible for the development and oversight of departmental data governance and strategy to enable data-driven policy for improving the health and well-being of North Carolinians. Dr. Tenenbaum is also an Assistant Professor in Duke University’s Department of Biostatistics and Bioinformatics. Dr. Tenenbaum is a member of the Board of Directors for the American Medical Informatics Association and serves on the Board of Scientific Counselors for NLM.

Liz Amos, MLIS

Liz Amos, MLIS, is Special Assistant to the Chief Health Data Standards Officer at NLM. She is a graduate of the University of Tulsa and the University of Oklahoma.

Enhancing Data Sharing, One Dataset at a Time

Guest post by Susan Gregurick, PhD, Associate Director for Data Science and Director, Office of Data Science Strategy, National Institutes of Health

Circular graphic showing Findable, Accessible, Interoperable, and Reusable aspects of the Vision of the NIH Strategic Plan for Data Science
Vision of the NIH Strategic Plan for Data Science

The National Institutes of Health (NIH) has an ambitious vision for a modernized, integrated biomedical data ecosystem. How we plan to achieve this vision is outlined in the NIH Strategic Plan for Data Science, and the long-term goal is to have NIH-funded data be findable, accessible, interoperable, and reusable (FAIR). To support this goal, we have made enhancing data access and sharing a central theme throughout the strategic plan.

While the topic of data sharing itself merits greater discussion, in this post I’m going to focus on one primary method for sharing data, which is through domain-specific and generalist repositories.

The landscape of biomedical data repositories is vast and evolving. Currently, NIH supports many repositories for sharing biomedical data. These data repositories all have a specific focus, either by data type (e.g., sequence data, protein structure, continuous physiological signals) or by biomedical research discipline (e.g., cancer, immunology, or clinical research data associated with a specific NIH institute or center), and often form a nexus of resources for their research communities. These domain-specific, open-access data-sharing repositories, whether funded by NIH or other sources, are good first choices for researchers, and NIH encourages their use.

NIH’s PubMed Central is a solution for storing and sharing datasets directly associated with publications and publication-related supplemental materials (up to 2 GB in size). On the other end of the spectrum, “big” datasets, comprising petabytes of data, are now starting to leverage cloud service providers (CSPs), including through the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative. These are still the early days of data sharing through CSPs, and we anticipate that this will be an active area of research.

There are, however, instances in which researchers are unable to find a domain-specific repository applicable to their research project. In these cases, a generalist repository that accepts data regardless of data type or discipline may be a good fit. Biomedical researchers already share data, software code, and other digital research products via many generalist repositories hosted by various institutions—often in collaboration with a library—and recommended by journals, publishers, or funders. While NIH does not have a recommended generalist repository, we are exploring the roles and uses of generalist repositories in our data repository landscape.

screenshot of NIH Figshare homepage
NIH Figshare homepage https://nih.figshare.com

For example, as part of our exploratory strategy, NIH recently launched an NIH Figshare instance, a short-term pilot project with the generalist repository Figshare. This pilot provides NIH-funded researchers with a generalist repository option for up to 100 GB of data per user. The NIH Figshare instance complies with FAIR principles; supports a wide range of data and file types; captures customized metadata; and provides persistent unique identifiers with the ability to track attention, use, and reuse.

NIH Figshare is just one part of our approach to understanding the role of generalist repositories in making biomedical research data more discoverable. We recognize that making data more FAIR is no small task and certainly not one that we can accomplish on our own. Through this pilot project, and other related projects associated with implementing NIH’s strategy for data science, we look forward to working with the biomedical community—researchers, librarians, publishers, and institutions, as well as other funders and stakeholders—to understand the evolving data repository ecosystem and how to best enable useful and usable data sharing.

Together we can strengthen our data repository ecosystem and ultimately, accelerate data-driven research and discovery. We invite you to join our efforts by sending your ideas and needs to datascience@nih.gov.

Susan Gregurick, PhD

Dr. Gregurick leads the NIH Strategic Plan for Data Science through scientific, technical, and operational collaboration with the institutes, centers, and offices that comprise NIH. She has substantial expertise in computational biology, high performance computing, and bioinformatics.

Accelerating Innovation in Science

Fast.

Healthcare.

Interoperability.

Resource.

Word about HL7 FHIR—pronounced “fire”—is spreading quickly across the National Institutes of Health (NIH) and scientific community, and for good reason.

The FHIR format is a global industry standard for exchanging health care data between institutions. Most electronic health records systems in hospitals and physicians’ offices already use FHIR to send and receive critical information for patient care and to support billing for patient services. With proper oversight and human subjects’ protection, clinical research and scientific advancement can benefit from FHIR. Data is becoming increasingly important for biomedical research, including predictive phenotyping and the conduct of clinical trials. 

NIH is taking steps to promote the use of FHIR in its funded clinical research to facilitate data access and promote interoperability of research data while protecting patient privacy and ensuring consistency with informed consent.

This afternoon, Clem McDonald, MD, Chief Health Data Standards Officer at the National Library of Medicine (NLM) at NIH, announced two notices issued today by NIH regarding FHIR at the Blue Button 2.0 Developer Conference held at the White House Eisenhower Executive Office Building in Washington, DC.

The Guide Notice on FHIR  encourages NIH-funded investigators to explore the use of FHIR to capture, integrate, and exchange clinical data for research purposes and to enhance capabilities to share research data. The Notice aims to make all NIH-funded researchers aware of the emerging ability to extract data from electronic health records using FHIR and encourage them to use FHIR-compatible formats when sharing data, consistent with privacy restrictions and informed consent. Following this notice, NIH will soon solicit input from the scientific community and other stakeholders about the tools that might be needed to support use of FHIR in biomedical research, as well as implementation challenges and opportunities they foresee in using FHIR. 

To complement the research component, NIH also posted a Notice of Special Interest to inform the small business innovation research and small business technology transfer communities of NIH’s interest in supporting applications that use FHIR in the development of health information technology products and services. NIH is interested in the implementation of the FHIR standard in health IT applications, such as the integration of patient- and population-level data from electronic health records systems, access to and management of electronic health information, the development of clinical decision support systems, the enhancement of recruitment into clinical trials, and improving privacy and security for electronic health information.

These efforts will help implement the NIH Strategic Plan for Data Science and build on activities already under way in individual Institutes and Centers within NIH. The NLM’s new strategic plan positions the NLM as a platform for data powered health, which will lead to the development of new analytics and novel visualization approaches to accelerate discovery-from-data.

As we promote the use of FHIR by our funded researchers and potential developers, institutes and centers across NIH—including the NLM, the National Human Genome Research Institute, and the National Center for Advancing Translational Sciences—are taking the lead in creating and using FHIR APIs in various research domains. Ongoing and emerging efforts aim to improve the retrieval of genomic and phenotypic data from the NIH database of genotypes and phenotypes, integrate data from genetic test results into electronic medical records, and prototype the infrastructure needed to query clinical data from partner organizations. Based on what we learn from our engagement with the scientific community, we anticipate supporting development of other tools and resources that can help scientists make better use of FHIR to enhance their research endeavors.

NIH is in a key position to contribute to the development of the research capabilities of FHIR—and we come ready to combine technology, scientific research, and data to make break-through discoveries to improve health.

We’re interested in your experiences and guidance. Come along with us, and let us know how NIH can help you and your researchers make the best use of FHIR.

Data Discovery at NLM

Guest post by David Hale, Information Technology Specialist at NLM.

Did you know that each day more than four million people use NLM resources and that every hour a petabyte of data moves in or out of our computing systems?

Those mammoth numbers indicate to me how essential NLM’s array of information products and services are to scientific progress. But as we gain more experience with providing information, particularly clinical, biologic, and genetic datasets, we’re finding that how we share data is as critical as the data itself.

To fuel the insights and solutions needed to improve public health, we must ensure data flow freely to the researchers, industry innovators, patient communities, and citizen scientists who can bring new lenses to these rich repositories of knowledge.

One way we’re opening doors to our data is through an open data portal called Data Discovery. While agencies like the Centers for Disease Control and the Centers for Medicare and Medicaid Services are already utilizing the same platform with success, NLM is the first of NIH’s Institutes and Centers to adopt the platform. Our first datasets are already available, including content from such diverse resources as the Dietary Supplement Label Database, Pillbox, ToxMap, Disaster Lit, and HealthReach.

Why did NLM take this step? While many of our data resources have long been publicly available online, housing them within Data Discovery offers unconstrained access and delivers key benefits:

  • Powerful data exploration tools—By showing the dataset as a spreadsheet, the Data Discovery platform offers freedom to filter and interact with the data in novel ways.
  • Intuitive data visualizations—A picture is worth a thousand words, and nowhere is that truer than leveraging data visualizations to bring new perspectives on scientific questions.
  • Open data APIs—Open data alone isn’t enough to fuel a new generation of insights. Open APIs are critical to making the data understandable, accessible, and actionable, based on the unique needs of the user or audience.

What does this mean in practice?

Let’s look at the Office of Dietary Supplements’ (ODS) Dietary Supplement Label Database (DSLD) to illustrate the potential of leveraging Data Discovery.

More than half of all Americans take at least one dietary supplement a day. Reliable information about those supplements is critical to their appropriate use, making DSLD a timely and important dataset to make available in an open data platform. Through Data Discovery, researchers, academics, health care providers, and the public will be able to explore and derive insights from the labels of more than 85,000 dietary supplement products currently or formerly sold in the US.

Developers and technologists who support research, health, and medical organizations require APIs that are modern, interoperable, and standards-compliant. Data Discovery provides a powerful solution to these needs, supporting NLM’s role as a platform for biomedical discovery and data-powered health.

Beyond fueling scientific discovery, open access to data holds another benefit for advancing public health: contributing to the professional development of data and informatics specialists. An increasingly important part of the health care workforce, informaticists help researchers extract the most meaningful insights from data, driving new developments in the lab and better management of patients and populations.

I invite you to explore the new Data Discovery portal. It’s an exciting step forward in achieving key aspects of the NLM Strategic Plan—to advocate for open science, further democratize access to data, and support the training and development of the data science workforce.

headshot of David Hale
Credit: Jacie Lee Almira Photography

David Hale is an Information Technology Specialist at the National Library of Medicine. In addition to leading Data Discovery, David is also project lead for NLM’s Pillbox, a drug identification, reference, and image resource. He received his Bachelor of Science in Physical Science from the University of Maryland.