NIH Strategically, and Ethically, Building a Bridge to AI (Bridge2AI)

This piece was authored by staff across NIH that serve on the working group for the NIH Common Fund’s Bridge2AI program—a new trans-NIH effort to harness the power of AI to propel biomedical and behavioral research forward.

The evolving field of Artificial Intelligence (AI) has the potential to revolutionize scientific discovery from bench to bedside. The understanding of human health and disease has vastly expanded as a result of research supported by the National Institutes of Health (NIH) and others. Every discovery and advance in contemporary medicine comes with a deluge of data. These large quantities of data, however, still result in restricted, incomplete views into the natural processes underlying human health and disease. These complex processes occur across the “health-disease” spectrum over temporal scales – sub-seconds to years – and biological scales – atomic, molecular, cellular, organ systems, individual to population. AI provides the computational and analytical tools that have the potential to connect the dots across these scales to drive discovery and clinical utility from all of the available evidence.

A new NIH Common Fund program, Bridge to Artificial Intelligence (Bridge2AI), will tap into the power of AI to lead the way toward insights that can ultimately inform clinical decisions and individualize care. AI, which encompasses many methods, including modern machine learning (ML), offers potential solutions to many challenges in biomedical and behavioral research.

AI emerged in the 1960s and has evolved substantially in the past two decades in terms of its utility for biomedical research. The impact of AI for biomedical and behavioral research and clinical care derives from its ability to use computer algorithms to quickly find connections from within large data sets and predict future outcomes. AI is already used to improve diagnostic accuracy, increase efficiency in workflow and clinical operations, and facilitate disease and therapeutic monitoring, to name a few applications. To date, the FDA has approved more than 100 AI-based medical products.

AI-assisted learning and discovery is only as good as the data used to train it. 

The use of AI/ML modeling in biomedical and behavioral research is limited by the availability of well-defined data to “train” AI algorithms to learn how to recognize patterns within the data. Existing biomedical and behavioral data sets rarely include all necessary information as they are collected on relatively small samples and lack the diversity of the U.S. population. Data from a variety of sources are necessary to characterize human health, such as those from -omics, imaging, behavior, and clinical indicators, electronic health records, wearable sensors, and population health summaries. The data generation process itself involves human assumptions, inferences, and biases that must be considered in developing ethical principles surrounding data collection and use. Standardizing collection processes is challenging and requires new approaches and methods. Comprehensive, systematically generated and carefully collected data is critical to build AI models that provide actionable information and predictive power. Data generation remains among the greatest challenges that must be resolved for AI to have a real-world impact on medicine.

Bridge2AI is a bold new initiative at the National Institutes of Health designed to propel research forward by accelerating AI/ML solutions to complex biomedical and behavioral health challenges whose resolution lies far beyond human intuition. Bridge2AI will support the generation of new biomedically relevant data sets amenable to AI/ML analysis at scale; development of standards across multiple data sources and types; production of tools to accelerate the creation of FAIR (Findable, Accessible, Interoperable, Reusable) AI/ML-ready data; design of skills and workforce development materials and activities; and promotion of a culture of diversity and ethical inquiry throughout the data generation process.

Bridge2AI plans to support several Data Generation Projects and an Integration, Dissemination and Evaluation (BRIDGE) Center to develop best practices for the use of AI/ML in biomedical and behavioral research. For additional information, see NOT-OD-21-021 and NOT-OD-21-022. Keep up with the latest news by visiting the Bridge2AI website regularly and subscribing to the Bridge2AI listserv.

Top Row (left to right):
Patricia Flatley Brennan, RN, PhD, Director, National Library of Medicine
Michael F. Chiang, MD, Director, National Eye Institute
Eric Green, MD, PhD, Director, National Human Genome Research Institute
Bottom Row (left to right):
Helene Langevin, MD, Director, National Center for Complementary and Integrative Health
Bruce J. Tromberg, PhD, Director, National Institute of Biomedical Imaging and Bioengineering

Walking in Each Other’s Shoes: Fostering Interdisciplinary Research Collaboration

Guest post by Teresa Przytycka, PhD, Senior Investigator, Computational and Systems Biology section of the Computational Biology Branch at the National Library of Medicine’s National Center for Biotechnology Information, National Institutes of Health

If you pose the question – “What is the difference between computational biology and bioinformatics?”– you will get many contradictory answers. Terminology aside, most researchers would agree that the space between traditional biology and traditional computer science is wide enough to accommodate many different models of collaborations between these two groups of researchers.  

Bioinformatics analysis, which involves the analysis of biological data such as DNA, RNA, and protein sequences, has become a standard step required after many types of now routine experiments. For example, after performing an experiment measuring gene expression in different conditions, bioinformatics analysis is likely to be used to compare gene expression between these conditions. In this setting, while experimental and computational components are necessary for the success of the project, only limited interaction between the experimentalist (the user of the tool) and the computational expert (the producer of the tool) is required.

However, given the richness of biomedical data and the complexity of the relations between various bimolecular entities, such as genes and proteins, researchers can be challenged to ask questions that cannot be answered through traditional means. In such cases, the user-producer model of collaboration is increasingly replaced by a different model of collaboration where biologists and computer scientists work side-by-side to both formulate and answer questions. Such collaborations across disciplines can introduce new perspectives and approaches to spur innovation and open the door to addressing new challenging questions.

Recognizing the need to foster interdisciplinary science, NIH formed an interdisciplinary committee to explore the development of a systems biology center at NIH in 2008. The driving idea behind such a center was to create a space where people of different backgrounds can mix, exchange ideas, and through these exchanges come to solutions to open biomedical questions enabled by interdisciplinary approaches. While this effort did not result in the establishment of a physical center, per se, it did give rise to a network of interdisciplinary collaborations. Interestingly, two of the collaborations NLM started at the time still continue today:  the collaboration with the Center for Cancer Research at the National Cancer Institute focusing on conformational dynamics of DNA structure, and the collaboration with the Laboratory of Cellular and Developmental Biology’s Developmental Genomics Section at the National Institute of Diabetes and Digestive and Kidney Diseases studying various aspects of gene regulation.  

What made these collaborations successful and long-lasting?  

Perhaps most importantly, the process of bringing experimental and computational groups together calls for a willingness to learn each other’s languages, thought processes, and cultures — or the proverbial walking in someone else’s shoes. 

Over the years, we learned to think together, work together, and publish many papers together. For example, in one of our joint projects, our experimental partners collected data that helped the computational group construct the gene regulatory network for a fly that can later be utilized by other researchers studying this model organism. In addition to the joy of advancing discovery together, these collaborations opened doors to foster synergies among young computational and experimental researchers from the collaborating groups. These interactions have enriched their NIH experience in important ways, giving them skills that they are likely to find very useful in the future — whether they go on to build their own research groups or follow other career choices.

I am confident that NLM’s support for interdisciplinary research through scientific collaborations will continue to spur innovation and discovery. It will also help to train a generation of researchers who can seamlessly work with people of difference scientific backgrounds.    

What do you value in your collaborations?

Teresa M. Przytycka, PhD, leads the Algorithmic Methods in Computational and Systems Biology section at the National Center for Biotechnology Information. Dr. Przytycka is particularly interested in the dynamical properties of biological systems, including spatial, temporal and contextual variations, and exploring how such variations impact gene expression, the functioning of biological pathways, and the phenotype of the organism.

NLM Announces New Annual Lecture on Science, Technology, and Society

Guest post by Maryam Zaringhalam, PhD, National Library of Medicine Data Science and Open Science Officer and Mike Huerta, PhD, director of the Office of Strategic Initiatives and associate director of the National Library of Medicine.

In October 2019, NLM invited award-winning science journalist Angela Saini to discuss her research on how bias and prejudice have crept into science. Her lecture examined how racist and sexist ideas have permeated science over its history — and how science, in turn, has been contorted to justify and perpetuate pseudoscientific myths of innate inferiority. Saini’s work and insights sparked a crucial conversation within NLM about our role and responsibility as the world’s largest biomedical library and a leader in data science research, situated within the nation’s premiere medical research agency, to question how systemic biases affect our work and determine how we can correct them.

As advancing equity and rooting out structural discrimination in science and technology have become an increasingly urgent federal priority, NLM will build on this discussion, in part, by announcing the launch of an annual NLM Science, Technology, and Society Lecture on March 1, 2021.

Situated at the nexus of the NIH-supported research community and the public, NLM plays a vital role not only in advancing cutting-edge research, but also in acting as a steward of biomedical information in service of society. As leaders in facilitating and shaping the future of biomedical data science, we must understand the implications of our work for society as a whole. We must, for instance, question how biases may creep into algorithms that connect research results with the public and think through the ethical ramifications of emerging technologies that might reinforce and amplify those biases. As a national library, we serve as curators of the history of biomedical science, which must reflect both the great achievements made possible by research and the injustices committed within the scientific community. And as an institution with more than 8,000 points of presence through our Network of the National Library of Medicine, we have the means to fulfill our responsibility to meet the needs and understand the concerns of the communities we serve.

With these responsibilities along with NLM’s unique role and capabilities in mind, the NLM Lecture on Science, Technology, and Society Lecture aims to raise awareness around the societal and ethical implications of the conduct of biomedical research and the use of advanced technologies, while seeding conversations across the Library, NIH, and the broader biomedical research community. NLM sees such considerations as fundamental to advancing biomedical discovery and human health for the benefit of all.

Dr. Kate Crawford is the inaugural Visiting Chair of AI and Justice at the École Normale Supérieure, as well as a Senior Principal Researcher at Microsoft Research, and the cofounder of the AI Now Institute at New York University.

Each spring, we plan to invite a leading voice working at the intersection of biomedicine, data science, ethics, and justice to present their research and how it relates to the mission and vision of NLM, as well as NIH more broadly. This year, we are pleased to host Dr. Kate Crawford, a leading scholar of science, technology, and society, with over 20 years of experience studying large scale data systems and artificial intelligence (AI) in the wider contexts of history, politics, labor, and the environment. Her lecture, “Atlas of AI: Mapping the social and economic forces behind AI”, will explore how machine learning systems can reproduce and intensify forms of structural bias and discrimination and offer new paths for thinking through the research ethics and policy implications of the turn to machine learning.

As the interests, priorities, and concerns of our society continue to evolve, particularly in response to emerging technologies and shifting national conversations, we hope this annual lecture, alongside established lecture series such as NLM History Talks, will provide an invaluable perspective on the societal implications of our work and further establish NLM’s leadership as a trusted partner in health.

Dr. Zaringhalam is a member of the Office of Strategic Initiatives and is responsible for monitoring and coordinating data science and open science activities and development across NLM, NIH, and beyond. She completed her PhD in molecular biology at Rockefeller University in 2017 before joining NLM as an AAAS Science and Technology Policy Fellow.

Dr. Huerta leads NLM in identifying, implementing, and assessing strategic directions of NLM, including at the intersection of data science and open science. In his 30 years at NIH, he has led many trans-NIH research initiatives and helped establish neuroinformatics as a field. Dr. Huerta joined NIH’s National Institute of Mental Health in 1991, before moving to NLM in 2011.

A Journey to Spur Innovation and Discovery

Guest post by Valerie Schneider, PhD, staff scientist at the National Library of Medicine’s National Center for Biotechnology Information, National Institutes of Health.

It’s been said that nature is the best teacher. When it comes to understanding human biology and improving health, examples abound of the advances that have been made from the study of a diverse set of non-human organisms. Over the last two centuries, the study of nematode worms has taught us about longevity and mRNAs (the biological molecule that is the basis for several COVID-19 vaccines), common fungi about cell division and cancer, and fruit flies about many things, from the role of chromosomes in heredity to our circadian rhythms. The ability to create targeted alterations in the genomes of model organisms has been transformative for studies to establish the function of specific genes in the etiology of human disease.

The modern era of genomic biology, in which genome sequencing and assembly are accessible to more researchers than ever before, provides data from an even greater range of organisms from which we might learn. Today, we rely not only on primate models, but on a whole host of species: for example, swine to understand organ transplantation, songbirds to understand vocalization and learning, and bats and pangolins to teach us about the evolution of the SARS-CoV-2 virus and how to fight its spread.

These rapidly growing collections of sequence and other data on species across the tree of life offer enormous promise for discoveries that have the potential to improve human health. To better enable such discoveries, with the support of NIH, NLM is planning a major modernization of its resources and their underlying infrastructure.

This modernization will support the needs of users engaged in data search and retrieval, gene annotation, evaluation of sequence quality, and comparative analyses. The new infrastructure, user interfaces, and tools should result in an improved experience for researchers doing a wide range of work, and also facilitate better data submissions.

This revamping aligns with NIH’s Strategic Plan for Data Science, which provides a roadmap for modernizing the NIH-funded biomedical data science ecosystem, as well as NLM’s Strategic Plan, which furthers NLM’s commitment to provide data and information to accelerate biomedical discovery and improve health. NLM and NIH are committed to providing researchers with modern, stable, and cloud-oriented technologies that support research needs.

Over the last few years, NLM has demonstrated this commitment by re-designing several flagship products, including the PubMed database for searching published biomedical literature, the database of information on privately and publicly funded clinical trials, and the Basic Local Alignment Search Tool (BLAST) for finding regions of similarity between biological sequences. As part of NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative, NLM also made the data from its massive (36 petabyte) Sequence Read Archive (SRA) available on two commercial cloud platforms, facilitating large-scale computational research that would otherwise be difficult for many researchers. Revamping these resources has positioned them to support both the current and future needs of NLM’s diverse audience of researchers, clinicians, data scientists, educators and others.

Importantly, this current initiative to modernize NLM products, tools, and services, and concurrently develop content, will include extensive engagement with the research community, just as we’ve done with previous re-design efforts. The NLM is committed to offering interfaces accessible to both novices and experts. Additionally, NLM believes a key part of the next generation of its data resources requires an infrastructure that supports an ongoing, dynamic exchange of content, including contributions of metadata and gene functional information from knowledge builders in the community to complement and enhance NIH-provided content.

Community engagement will also ensure that externally sourced content is provided in ways that maintain the high value and trustworthiness of the datasets. Additionally, data connections that make the content of this new resource accessible to external knowledgebases containing other datatypes, such as images, will further promote integrative data analyses that support scientific discovery.

Many opportunities exist to streamline processes, look across resources, and gain insights that will provide new ways of learning. Through NLM’s continued commitment to modernization initiatives, we are ready to again improve the user experience for accessing, analyzing and visualizing sequence data and related information. Nature continues to be our best teacher — and we are now poised to learn from her in an exciting new classroom.

We invite you to come on this journey with us.

Valerie Schneider, PhD, is the deputy director of Sequence Offerings and the head of the Sequence Plus program. In these roles, she coordinates efforts associated with the curation, enhancement, and organization of sequence data, as well as oversees tools and resources that enable the public to access, analyze, and visualize biomedical data. She also manages NCBI’s involvement in the Genome Reference Consortium, the international collaboration tasked with maintaining the value of the human reference genome assembly.

Health Data Standards: A Common Language to Support Research and Health Care

Guest post by Dianne Babski, Associate Director for Library Operations and Robin Taylor, MLIS, National Library of Medicine

Every day we benefit from data standards, and every day most of us don’t even notice it! Did you wear a seatbelt today? Take a precise dose of medicine? Send an email? Plug a laptop into an outlet? These are examples of activities that are made possible through data standards. At NLM, we think a lot about data standards, particularly health data standards.

NLM partners with organizations such as the Office of the National Coordinator for Health Information Technology (ONC) to promote health data standards for data captured in electronic health records (EHRs), clinical research, and other health information systems. With a focus on how health data are collected, stored, described, and retrieved, health data standards make up the backbone of interoperability. This provides the ability to connect and seamlessly share data between computerized systems and allows for the information exchange between other applications and databases.

Let’s look at a current example where health data standards, a common data language, have had a real impact. When SARS-CoV-2, and the disease it causes, COVID-19, emerged in late 2019, researchers around the world began planning studies to figure out how to combat this global pandemic. Research questions, such as, “What date did the patient first display COVID-19 symptoms?” arose continuously. It sounds like a simple question, but there are so many ways to ask the question, and even more possible responses. If researchers apply health data standards in their investigations — if they ask questions and collect responses in a standardized way — the data they collect can be combined and compared with data from other COVID-19 studies and EHRs. This enables reuse of data across multiple sources, which increases statistical power and accelerates our understanding of this disease.  

For more than 20 years, NLM has served as the central coordinating body for clinical terminology standards nationally. Our long-standing efforts to establish common health terminology supported the COVID-19 response by allowing access to near-real time clinical information to guide the diagnosis, treatment, and prevention of this disease.

NLM supports multiple vocabulary standards and mappings, like RxNorm, SNOMED CT, and the UMLS, as well as terminology tools like AccessGUDID, DailyMed, MedlinePlus Connect, MetaMap, the Value Set Authority Center (VSAC), and the NIH CDE Repository, a database that provides access to structured human and machine-readable definitions of common data elements, more commonly referred to as CDEs.

CDEs are one type of health data standard that can help researchers normalize data across studies. CDEs are standardized, precisely defined questions that are paired with a set of specific allowable responses, then used systematically across different sites, studies, or clinical trials to ensure consistent data collection.

CDEs are in use across NIH, to varying degrees. Some NIH institutes and Centers have had mature CDE programs for years; others are just beginning to develop. NLM has been involved with CDEs since 2012 and plays a key role in encouraging CDE adoption across NIH by:

  • Hosting the NIH CDE Task Force (CDETF), a trans-NIH community of practice.
  • Forming a CDE Governance Committee that reports to the CDETF. The committee’s primary charge is to decide whether common data elements submitted to them by NIH recognized bodies (NIH Institutes, offices, etc.) meet criteria that merit their recommendation for use in NIH-funded research.
  • Maintaining the NIH CDE Repository, a central access point to data elements that have been recommended or required by NIH Institutes and Centers for use in research and for other purposes. In 2020, we completed a usability study of the NIH CDE Repository and have been implementing enhancements based on the recommendations.

This year, while continuing to enhance the usability of the NIH CDE Repository, we will also engage with users through a CDE awareness and training campaign.

Ms. Babski is responsible for overall management of one of NLM’s largest divisions with more than 450 staff who provide health information services to a global audience of health care professionals, researchers, administrators, students, historians, patients, and the public.

Robin Taylor, MLIS, joined NLM in 2016. Since 2018, she has been the lead for the NIH Common Data Elements Repository.