Traveling a Bridge2AI in a Quest for High-Quality, FAIR Data Sets

This blog was authored by NIH staff who serve on the Bridge to Artificial Intelligence (Bridge2AI) Working Group.

In April 2021, we introduced NIH Common Fund’s Bridge to Artificial Intelligence (Bridge2AI) program to tap the potential of artificial intelligence (AI) for revolutionizing biomedical discovery, increasing our understanding of human health, and improving the practice of medicine. In the past year, Bridge2AI researchers have been creating guidance and standards for the development of ethically sourced, state-of-the-art, AI-ready data sets to help solve some of the most pressing challenges in human health such as uncovering how genetic, behavioral, and environmental factors influence health and wellness. The program will also support the training required to enable the broader biomedical and behavioral research community to leverage AI technologies.

The NIH initiative will support diverse teams and tools to ensure that data sets adhere to FAIR (Findable, Accessible, Interoperable, and Reusable) data principles. Beyond ensuring compliance to FAIR principles, Bridge2AI will develop and disseminate best practices that promote a culture of diversity and continuous ethical inquiry into how data are collected.

The Bridge2AI program will support innovative data-generation projects nationwide to collect complex AI-ready data in four biomedical areas:

Clinical Care Informatics—Intensive care units treat patients with urgent medical conditions such as sepsis and cardiac arrest. This data generation project will collect, integrate, annotate, and share high-resolution physiological data from adult and pediatric critical care patients from 14 health systems that can then be used by AI technologies to identify approaches to improve recovery from acute illness.

Functional GenomicsWithin each cell in the human body lies a wealth of information about health, disease, and the impact of environmental factors. This project will generate richly detailed proteomic, genomic, and cellular imaging data to help predict disease mechanisms and associated gene pathways and networks for a variety of health outcomes.

Precision Public Health—The human voice is as unique as a fingerprint and has been found to contain acoustic signatures of human health and disease. This project will collect large-scale multimodal data sets containing voice, genomic, and clinical data, which AI technologies can use to help improve screening for and the diagnosis and treatment of a variety of developmental, neurological, and mental health conditions.

Return to Health—Much can be learned by uncovering how individuals move from a less healthy to a healthier state, a process called salutogenesis. This project will collect data from a diverse population with varying stages of type 2 diabetes to help improve our understanding of chronic disease progression and recovery. To learn more about Bridge2AI and salutogenesis, please view Bridging Our Way to Health Restoration by Helene M. Langevin, MD, director of the National Center for Complementary and Integrative Health.

To support these data generation projects, the Bridge2AI program includes a BRIDGE Center with a range of expertise to support interdisciplinary team science. The center will facilitate development of cross-cutting products such as standards harmonization, ethical AI best practices, and workforce development opportunities for the research community.

One of the goals of Bridge2AI is to foster a culture that will identify, assess, and address ethical issues as an integral part of creating AI-ready data sets. Ethical considerations include informed consent, data privacy, bias in data, and its impact on fairness and trustworthiness of AI applications, equity, and justice, and inclusion and transparency in design.

Every component of the Bridge2AI program includes a plan for incorporating diverse perspectives at every step. The BRIDGE Center will serve as a hub for supporting ethical and trustworthy AI development across Bridge2AI with the goal of providing tools, best practices, and resources to address cross-cutting biomedical challenges.

Learn more about Bridge2AI in the press release and video. Find the latest news by visiting the Bridge2AI website and following the @NIH_CommonFund on Twitter.

Top Row (left to right):
Patricia Flatley Brennan, RN, PhD, Director, National Library of Medicine
Michael F. Chiang, MD, Director, National Eye Institute
Eric D. Green, MD, PhD, Director, National Human Genome Research Institute

Bottom Row (left to right):
Helene M. Langevin, MD, Director, National Center for Complementary and Integrative Health
Bruce J. Tromberg, PhD, Director, National Institute of Biomedical Imaging and Bioengineering

RADx-UP Program Addresses Data Gaps in Underrepresented Communities

Guest post by Richard J. Hodes, MD, Director, National Institute on Aging, and Eliseo Pérez-Stable, MD, Director, National Institute on Minority Health and Health Disparities, NIH.

A few months into the COVID-19 pandemic, we shared how NIH was working to speed innovation in the development, commercialization, and implementation of technologies for COVID-19 through NIH’s Rapid Acceleration of Diagnostics (RADx) initiative.

Two years later, one of the RADx programs—RADx Underserved Populations (RADx-UP)—reflects on lessons learned that have broken the mold of standard research paradigms to address health disparities.

Use of Common Data Elements

RADx-UP has presented unique challenges in terms of data collection, privacy concerns, measurement standardization, principles of data-sharing, and the opportunity to reexamine community-engaged research. Establishment of Common Data Elements (CDEs)—standardized, precisely defined questions paired with a set of allowable responses used systematically across different sites, studies, or clinical trials to ensure that the whole is greater than the sum of its parts—are not commonly used in community-engaged research. Use of CDEs enables data harmonization, aggregation, and analysis of related data across study sites as well as the ability to investigate relationships among data in unrelated data sets. CDEs can also lend statistical power to analyses of data for small subpopulations typically underrepresented in research.

RADx-UP is a community-engaged research program that builds on years of developing partnerships between communities and scientists. RADx-UP has funded 127 research projects with sites in every state and six U.S. territories as well as a RADx-UP Coordination and Data Collection Center (CDCC). RADx-UP assesses the needs and barriers related to COVID-19 testing and increase access to COVID-19 testing in underserved and vulnerable populations experiencing the highest rates of disparities in morbidity and mortality.

The COVID-19 pandemic necessitated establishing RADx-UP and its associated CDEs with unprecedented speed relying heavily on data elements derived from those already defined in the NIH-based PhenX Toolkit and Disaster Research Response (DR2) resources. The short time frame for this process did not allow for as extensive collaboration and input from RADx-UP investigators and community partners that would have been ideal. Additionally, many researchers, especially community partners engaged in RADx-UP projects, were not familiar with CDE data collection practices. As a result, CDE questionnaires had to be modified as studies progressed to better suit the needs of the consortium and investigators new to CDE collection had to be familiarized with these processes quickly. NIH program officers, NIH RADx-UP and CDCC leadership and engagement impact teams (EITs)—staff liaisons provided by the CDCC that link RADx-UP research teams to testing, data, and community-engagement resources—helped research teams implement and adjust CDE collection, ensured alignment across consortium research teams, and assisted with other data-related issues that arose.

All RADx programs are required to collect a standardized set of CDEs, including sociodemographic, medical history, and health status elements with the intent to provide researchers rapid access to data for secondary research analyses in the RADx Data Hub, the central repository for RADx data. However, implementation of CDEs in the context of underserved communities in the rapidly evolving COVID-19 pandemic presented complex issues for consideration.

Some of these issues included data privacy, the risk of re-identification of underserved and undocumented populations, and data collection burden on participants as well as researchers. The privacy of health data is protected under federal law. The RADx-UP program instituted measures to ensure program participants’ data remain protected and de-identified using a token-based hashing algorithm methodology that allows researchers to share individual-level participant data without exposing personally identifiable information. To address data collection and respondent burden concerns, projects modified questions to allow some flexibility in expanding response options more appropriate to some underserved communities. The CDCC also developed COLECTIV, a digital interface for projects to directly enter data into the data repository and included gateway questions to relieve respondent burden.

Respect for Tribal Data Sovereignty

RADx-UP leadership and investigators recognized that additional considerations for tribal sovereignty, practices, and policies needed to be addressed for projects that include American Indian and Alaska Native (AI/AN) participants. Through consultations with the NIH Tribal Advisory Committee and the broader AI/AN community and meetings with an informal RADx-UP AI/AN project working group established by the CDCC, NIH realized that deposition of tribal data into the RADx Data Hub would not meet the cultural, governance, or sovereignty needs of AI/AN RADx research data. In response, NIH hopes to establish a RADx Tribal Data Repository (TDR) responsible for the collection, protection, and sharing of data collected in AI/AN communities with respect for the practices and policies of Tribal data sovereignty. Applications for the repository have been solicited and NIH hopes to make an award for the TDR sometime in FY23.

Rapid Data Sharing

One of the largest hurdles the RADx-UP program has faced is implementing rapid sharing of research data for secondary analyses and to inform decision-making and public health practices related to the COVID-19 pandemic. RADx-UP research teams are expected to share their data on a timely cadence before data collection ends. This is a far more stringent practice relative to the current standard NIH data-sharing policy that requires data to be shared at the time of acceptance for publication of the main findings from the final data set. NIH and CDCC staff have worked together with the RADx research community to highlight the importance of and compliance with rapid data-sharing. Within the first six months, a total of 69 Phase 1 projects began transmitting CDE data to the RADx-UP CDCC. The COVID-19 pandemic posed a tremendous challenge, and NIH responded by collaborating with vulnerable and underserved communities. This collaboration has opened an unprecedented opportunity to build on a now established foundation for future research to address gaps in understanding the broader social, cultural, and structural factors that influence disparities in morbidity and mortality from COVID-19 and other diseases. Data collection and sharing efforts of the RADx-UP initiative comprise a significant contribution. Collaboration among the NIH, research investigators, and communities impacted by COVID-19 has been the catalyst. To learn more about RADx-UP, please visit a recent journal article available on PubMed.


Dr. Hodes has served as NIA director since 1993, overseeing studies of the biological, clinical, behavioral, and social aspects of aging. He has devoted his tenure to the development of a strong, diverse, and balanced research program focused on the genetics and biology of aging, basic and clinical studies aimed at reducing disease and disability, and investigation of the behavioral and social aspects of aging. Ultimately, these efforts have one goal — improving the health and quality of life for older people and their families. As a leading researcher in the field of immunology, Dr. Hodes has published more than 250 peer-reviewed papers.

Dr. Pérez-Stable practiced primary care internal medicine for 37 years at the University of California, San Francisco before becoming the Director of NIMHD in 2015. His research interests have centered on improving the health of individuals from racial and ethnic minority communities through effective prevention interventions, understanding underlying causes of health disparities, and advancing patient-centered care for underserved populations. Recognized as a leader in Latino health care and disparities research, he spent 32 years leading research on smoking cessation and tobacco control in Latino populations in the United States and Latin America. Dr. Pérez-Stable has published more than 300 peer-reviewed papers.

Using Large Datasets to Improve Health Outcomes

Guest post by Lyn Hardy, PhD, RN, Program Officer, Division of Extramural Programs, National Library of Medicine, National Institutes of Health.

Before the advent of algorithms to determine the best way to treat and prevent heart disease, a health care provider looking for best practices for their patients may not have had the resources to find that best method. Today, health care decision-making for individuals and their health care providers is made easier by predictive and preventive models, which were developed with the goal of guiding the decision-making process. One example is the Patient Level Prediction of Clinical Outcomes and Cost-Effectiveness project led by Columbia University Health Sciences.

These models are created using computer algorithms (a set of rules for problem-solving) based on data science methods that analyze large amounts of data. While computers can analyze facts within the data, they rely on human programming to define what pieces of data or what data types are important to include in the analysis to create a valid algorithm and model. The results are translated into information that health care providers can use to understand patterns and provide methods for predicting and preventing illness. If a health care provider is looking for ways to prevent heart disease, an accurate model might describe methods—like exercise, diet, and mindfulness practices—that can achieve that goal.

Algorithms and models have benefited the world by using special data science methods and techniques to understand patterns that guide clinical decisions, but identifying data used in their development still requires practitioners to be conscious of the results. Research has shown that algorithms and models can be misleading or biased if they do not account for population differences like gender, race, and age. These biases, also known as algorithmic fairness, can adversely affect the health of underserved populations by not giving individuals and health care providers information specific to and that directly addresses their diversity. An example of potential algorithm bias is creating an algorithm to treat hypertension without including variated treatments for women or considering life-related stress or the environment.

Researchers are focusing on methods to create fair and equitable algorithms and models to provide all populations with the best and most appropriate health care decisions. Researchers in our NLM Extramural Programs analyze this data through NLM funding opportunities that foster scientific inquiry so we better understand algorithmic effects on minority and marginalized populations. Some of those funding opportunities include NLM Research Grants in Biomedical Informatics and Data Science (R01 Clinical Trial Optional) and the NIH Research Project Grant (Parent R01 Clinical Trial Not Allowed).

NLM is interested in state-of-the-art methods and approaches to address problems using large health data sets and tools to analyze them. Specific areas of interest include:

  • Developing and testing computational or statistical approaches to apply to large or merged health data sets containing human and non-human data, with a focus on understanding and characterizing the gaps, errors, biases, and other limitations in the data or inferences based on the data.
  • Exploring approaches to correct these biases or compensate for missing data, including introducing debiasing techniques and policies or using synthetic data.
  • Testing new statistical algorithms or other computational approaches to strengthen research designs using specific types of biomedical and social/behavioral data.
  • Generating metadata that adequately characterizes the data, including its provenance, intended use, and processes by which it was collected and verified.
  • Improving approaches for integrating, mining, and analyzing health data in a way that preserves that data’s confidentiality, accuracy, completeness, and overall security.

These funding opportunities encourage inquiry into algorithmic fairness to improve health care for all individuals, especially those who are underserved. By using new research models that account for diverse populations, we will be able to provide data that will support the best treatment outcomes for everyone.

Dr. Hardy’s work and expertise focus on using health informatics to improve public health and health care decision-making. Dr. Hardy has held positions as a researcher and academician and is active in national informatics organizations. She has written and edited books on informatics and health care.

Informing Success from the Outside In: Introducing the NLM Board of Regents CGR Working Group

Guest post by Valerie Schneider, PhD, staff scientist at the National Library of Medicine (NLM) National Center for Biotechnology Information (NCBI), National Institutes of Health (NIH), and Kristi Holmes, PhD, Director of Galter Health Sciences Library & Learning Center and Professor of Preventive Medicine at Northwestern University Feinberg School of Medicine.

Last year, we described how NLM is developing the NIH Comparative Genomics Resource (CGR)—a project that offers content, tools, and interfaces for genomic data resources associated with eukaryotic research organisms—in two blog posts:

Eukaryote refers to any single-celled or multicellular organisms whose cell contains a distinct and membrane-bound nucleus. Since eukaryotes all likely evolved from the same common ancestry, studying them can grant us insight into how other eukaryotes—including those in humans—work and makes CGR and its resources that much more important to eukaryotic research.

CGR aims to:

  • Promote high-quality eukaryotic genomic data submission.
  • Enrich NLM’s genomic-related content with community-sourced content.
  • Facilitate comparative biological analyses.
  • Support the development of the next generation of scientists.

Since our last two posts, the team at NCBI has been hard at work making important technical and content updates to and socializing CGR’s suite of tools. For instance, they published new webpages that organize genome-related data by taxonomy, making it available for browsing and immediate download. They also created the ClusteredNR Database, a new database for the Basic Local Alignment Search Tool (BLAST), to provide results with greater taxonomic context for sequence searches, and incorporated new gene information from the Alliance of Genome Resources, an organization that unites data and information for model organisms’ unique aspects, into Gene. NCBI is also engaging with genomics communities to understand their needs and requirements for comparative genomics through the NLM Board of Regents Comparative Genomics Working Group.

The working group is lending their perspective and extensive expertise to the project, activities that are essential to CGR’s success and development. We have charged working group members with guiding the development of a new approach to scientific discovery that relies on genomic-related data from research organisms, helping project teams keep pace with changes in the field, and understanding the scientific community’s needs and expectations for key functionalities. To do this, working group members help NLM set development priorities such as exploring CGR’s integration with existing infrastructures and related workforce development opportunities.

Projects like CGR highlight how critical interdisciplinary collaboration is to modern research and how success requires community perspectives and involvement. Working group members will be sharing more information about this project at upcoming conferences and in biomedical literature, and our team at NCBI will also share events and resources through our NIH Comparative Genomics Resource website.

If you are a member of a model organism community, are working on emerging eukaryotic research models, or support eukaryotic genomic data—whether you are a researcher, educator, student, scholarly society member, librarian, data scientist, database resource manager, developer, epidemiologist, or other stakeholder in our progress—we encourage you to reach out and get involved. Here are a few suggestions:

  • Invite us to join you at a conference, teach a workshop, partner on a webinar, or discuss other ideas you may have to foster information sharing and feedback.
  • Use and share CGR’s suite of tools and share your feedback.
  • Be on the lookout for project updates and events on the CGR website or follow @NCBI on Twitter.

We’re always excited to get feedback through CGR listening sessions and user testing for tool and resource updates. Email cgr@nlm.nih.gov to learn all the ways you can participate.

Thank you to the members of the NLM Board of Regents CGR Working Group!

Alejandro Sanchez Alvarado, PhD

Executive Director and Chief Scientific Officer
Priscilla Wood Neaves Chair in the Biomedical Sciences
Stowers Institute for Medical for Medical Research

Hannah Carey, PhD
Professor, Department of Comparative Biosciences, School of Veterinary Medicine
University of Wisconsin-Madison

Wayne Frankel, PhD
Professor, Department of Genetics & Development
Director of Preclinical Models, Institute of Genomic Medicine
Columbia University Medical Center

Kristi L. Holmes, PhD (Chair)
Director, Galter Health Services Library & Learning Center
Professor of Preventive Medicine (Health & Biomedical Informatics)
Northwestern University Feinberg School of Medicine

Ani W. Manichaikul, PhD
Associate Professor, Center for Public Health Genomics
University of Virginia School of Medicine

Len Pennacchio, PhD
Senior Scientist
Lawrence Berkeley National Laboratory

Valerie Schneider, PhD (Executive Secretary)
Program Head, Sequence Enhancements, Tools and Delivery (SeqPlus)
HHS/NIH/NLM/NCBI

Kenneth Stuart, PhD
Professor, Center of Global Infectious Disease Research
Seattle Children’s Research Institute

Tandy Warnow, PhD
Grainger Distinguished Chair in Engineering
Associate Head of Computer Science
University of Illinois, Champaign-Urbana

Rick Woychik, PhD (NIH CGR Steering Committee Liaison)
Director, National Institute of Environmental Health Sciences (NIEHS) and the National Toxicology Program (NTP)

Cathy Wu, PhD
Unidel Edward G. Jefferson Chair in Engineering and Computer Science
Director, Center for Bioinformatics & Computational Biology
Director, Data Science Institute
University of Delaware

Dr. Schneider is the deputy director of Sequence Offerings and the head of the Sequence Plus program. In these roles, she coordinates efforts associated with the curation, enhancement, and organization of sequence data, as well as oversees tools and resources that enable the public to access, analyze, and visualize biomedical data. She also manages NCBI’s involvement in the Genome Reference Consortium, which is the international collaboration tasked with maintaining the value of the human reference genome assembly.

Dr. Holmes is dedicated to empowering discovery and equitable access to knowledge through the development of computational and social architectures to support these goals. She also serves on the leadership team of the Northwestern University Clinical and Translational Sciences Institute.

Bridging the Resource Divide for Artificial Intelligence Research

This blog post is by Lynne Parker, Director, National AI Initiative Office and was originally posted on the White House Office of Science and Technology Policy blog. The Office of Science and Technology Policy and the National Science Foundation are seeking comments on the initial findings and recommendations contained in the interim report of the National Artificial Intelligence Research Resource (NAIRR) Task Force (“Task Force”) and particularly on potential approaches to implement those recommendations. We encourage you to read the RFI and submit comments on Implementing Initial Findings and Recommendations of the National Artificial Intelligence Research Resource Task Force by June 30, 2022.

Artificial Intelligence (AI) is transforming our world. The field is an engine of innovation that is already driving scientific discovery, economic growth, and new jobs. AI is an integral component of solutions ranging from those that tackle routine daily tasks to societal-level challenges, while also giving rise to new challenges necessitating further study and action. Most Americans already interact with AI-based systems on a daily basis, such as those that help us find the best routes to work and school, select the items we buy, and ask our phones to remind us of upcoming appointments.

Once studied by few, AI courses are now among the most popular across America’s universities. AI-based companies are being founded and scaled at a rapid rate. Worldwide AI-related research publications and patent applications continue to climb. 

However, this growth in the importance of AI to our future and the size of the AI community obscures the reality that the pathways to participate in AI research and development (R&D) often remain limited to those with access to certain essential resources. Progress at the current frontiers of AI is often tied to the use of large volumes of advanced computational power and data, and access to those resources today is too often limited to large technology companies and well-resourced universities. Consequently, the breadth of ideas and perspectives incorporated into AI innovations can be limited and lead to the creation of systems that perpetuate biases and other systemic inequalities.

This growing resource divide has the potential to adversely skew our AI research ecosystem, and in the process, threaten our Nation’s ability to cultivate an AI research community and workforce that reflects America’s rich diversity – and harness AI in a manner that serves all Americans. To prevent unintended consequences or disparate impacts from the use of AI, it matters who is doing the AI research and development.

Established in June 2021 pursuant to the National AI Initiative Act of 2020, the National AI Research Resource (NAIRR) Task Force has been seeking to address this resource divide. As a Congressionally-chartered Federal advisory committee, the NAIRR Task Force has been developing a plan for the establishment of a National AI Research Resource that would democratize access to AI R&D for America’s researchers and students. The NAIRR is envisioned as a broadly available and federated collection of resources, including computational infrastructure, public- and private-sector data, and testbeds. These resources would be made easily accessible in a manner that protects privacy, with accompanying educational tools and user support to facilitate their use. An important element of the NAIRR will be the expertise to design, deploy, federate, and operate these resources.

Since its establishment, the Task Force has held 7 public meetings, engaged with 39 experts on a wide range of aspects related to the design of the NAIRR, and considered 84 responses from the public to a request for information (RFI). Materials from all public meetings and responses to the RFI can be found at www.AI.gov/nairrtf.

Today, as co-chair of the Task Force and as part of OSTP’s broader work to advance the responsible research, development, and use of AI, I am proud to announce the submission of the interim report of the NAIRR Task Force to the President and Congress. This report lays out a vision for how this national cyberinfrastructure could be structured, designed, operated, and governed to meet the needs of America’s research community. In the report, the Task Force presents an approach to establishing the NAIRR that builds on existing and future Federal investments; designs in protections for privacy, civil rights, and civil liberties; and promotes diversity and equitable access. It details how the NAIRR should support the full spectrum of AI research – from foundational to use-inspired to translational – by providing opportunities for students and researchers to access resources that would otherwise be out of their reach. The vision laid out in this interim report is the first step towards a more equitable future for AI R&D in America – a future where innovation can flourish and the promise of AI can be realized in a way that works for all Americans.

Going forward, the Task Force will develop a roadmap for achieving the vision defined in the interim report. This implementation roadmap is planned for release as the final report of the Task Force at the end of this year. To inform this work, we are asking for feedback from the public on the findings and recommendations presented in the interim report as well as how those recommendations could be effectively implemented. Public responses to this request for information will be accepted through June 30, 2022. In addition, OSTP and the National Science Foundation will host a public listening session on June 23 to provide additional means for public input. Please see here for more information on how to participate.

If successful, the NAIRR would transform the U.S. national AI research ecosystem by strengthening and democratizing foundational, use-inspired, and translational AI R&D in the United States. The interim report of the NAIRR Task Force being released today represents a first step towards this future, putting forward a vision for the NAIRR for public comment and feedback.

%d bloggers like this: