Traveling a Bridge2AI in a Quest for High-Quality, FAIR Data Sets

This blog was authored by NIH staff who serve on the Bridge to Artificial Intelligence (Bridge2AI) Working Group.

In April 2021, we introduced NIH Common Fund’s Bridge to Artificial Intelligence (Bridge2AI) program to tap the potential of artificial intelligence (AI) for revolutionizing biomedical discovery, increasing our understanding of human health, and improving the practice of medicine. In the past year, Bridge2AI researchers have been creating guidance and standards for the development of ethically sourced, state-of-the-art, AI-ready data sets to help solve some of the most pressing challenges in human health such as uncovering how genetic, behavioral, and environmental factors influence health and wellness. The program will also support the training required to enable the broader biomedical and behavioral research community to leverage AI technologies.

The NIH initiative will support diverse teams and tools to ensure that data sets adhere to FAIR (Findable, Accessible, Interoperable, and Reusable) data principles. Beyond ensuring compliance to FAIR principles, Bridge2AI will develop and disseminate best practices that promote a culture of diversity and continuous ethical inquiry into how data are collected.

The Bridge2AI program will support innovative data-generation projects nationwide to collect complex AI-ready data in four biomedical areas:

Clinical Care Informatics—Intensive care units treat patients with urgent medical conditions such as sepsis and cardiac arrest. This data generation project will collect, integrate, annotate, and share high-resolution physiological data from adult and pediatric critical care patients from 14 health systems that can then be used by AI technologies to identify approaches to improve recovery from acute illness.

Functional GenomicsWithin each cell in the human body lies a wealth of information about health, disease, and the impact of environmental factors. This project will generate richly detailed proteomic, genomic, and cellular imaging data to help predict disease mechanisms and associated gene pathways and networks for a variety of health outcomes.

Precision Public Health—The human voice is as unique as a fingerprint and has been found to contain acoustic signatures of human health and disease. This project will collect large-scale multimodal data sets containing voice, genomic, and clinical data, which AI technologies can use to help improve screening for and the diagnosis and treatment of a variety of developmental, neurological, and mental health conditions.

Return to Health—Much can be learned by uncovering how individuals move from a less healthy to a healthier state, a process called salutogenesis. This project will collect data from a diverse population with varying stages of type 2 diabetes to help improve our understanding of chronic disease progression and recovery. To learn more about Bridge2AI and salutogenesis, please view Bridging Our Way to Health Restoration by Helene M. Langevin, MD, director of the National Center for Complementary and Integrative Health.

To support these data generation projects, the Bridge2AI program includes a BRIDGE Center with a range of expertise to support interdisciplinary team science. The center will facilitate development of cross-cutting products such as standards harmonization, ethical AI best practices, and workforce development opportunities for the research community.

One of the goals of Bridge2AI is to foster a culture that will identify, assess, and address ethical issues as an integral part of creating AI-ready data sets. Ethical considerations include informed consent, data privacy, bias in data, and its impact on fairness and trustworthiness of AI applications, equity, and justice, and inclusion and transparency in design.

Every component of the Bridge2AI program includes a plan for incorporating diverse perspectives at every step. The BRIDGE Center will serve as a hub for supporting ethical and trustworthy AI development across Bridge2AI with the goal of providing tools, best practices, and resources to address cross-cutting biomedical challenges.

Learn more about Bridge2AI in the press release and video. Find the latest news by visiting the Bridge2AI website and following the @NIH_CommonFund on Twitter.

Top Row (left to right):
Patricia Flatley Brennan, RN, PhD, Director, National Library of Medicine
Michael F. Chiang, MD, Director, National Eye Institute
Eric D. Green, MD, PhD, Director, National Human Genome Research Institute

Bottom Row (left to right):
Helene M. Langevin, MD, Director, National Center for Complementary and Integrative Health
Bruce J. Tromberg, PhD, Director, National Institute of Biomedical Imaging and Bioengineering

RADx-UP Program Addresses Data Gaps in Underrepresented Communities

Guest post by Richard J. Hodes, MD, Director, National Institute on Aging, and Eliseo Pérez-Stable, MD, Director, National Institute on Minority Health and Health Disparities, NIH.

A few months into the COVID-19 pandemic, we shared how NIH was working to speed innovation in the development, commercialization, and implementation of technologies for COVID-19 through NIH’s Rapid Acceleration of Diagnostics (RADx) initiative.

Two years later, one of the RADx programs—RADx Underserved Populations (RADx-UP)—reflects on lessons learned that have broken the mold of standard research paradigms to address health disparities.

Use of Common Data Elements

RADx-UP has presented unique challenges in terms of data collection, privacy concerns, measurement standardization, principles of data-sharing, and the opportunity to reexamine community-engaged research. Establishment of Common Data Elements (CDEs)—standardized, precisely defined questions paired with a set of allowable responses used systematically across different sites, studies, or clinical trials to ensure that the whole is greater than the sum of its parts—are not commonly used in community-engaged research. Use of CDEs enables data harmonization, aggregation, and analysis of related data across study sites as well as the ability to investigate relationships among data in unrelated data sets. CDEs can also lend statistical power to analyses of data for small subpopulations typically underrepresented in research.

RADx-UP is a community-engaged research program that builds on years of developing partnerships between communities and scientists. RADx-UP has funded 127 research projects with sites in every state and six U.S. territories as well as a RADx-UP Coordination and Data Collection Center (CDCC). RADx-UP assesses the needs and barriers related to COVID-19 testing and increase access to COVID-19 testing in underserved and vulnerable populations experiencing the highest rates of disparities in morbidity and mortality.

The COVID-19 pandemic necessitated establishing RADx-UP and its associated CDEs with unprecedented speed relying heavily on data elements derived from those already defined in the NIH-based PhenX Toolkit and Disaster Research Response (DR2) resources. The short time frame for this process did not allow for as extensive collaboration and input from RADx-UP investigators and community partners that would have been ideal. Additionally, many researchers, especially community partners engaged in RADx-UP projects, were not familiar with CDE data collection practices. As a result, CDE questionnaires had to be modified as studies progressed to better suit the needs of the consortium and investigators new to CDE collection had to be familiarized with these processes quickly. NIH program officers, NIH RADx-UP and CDCC leadership and engagement impact teams (EITs)—staff liaisons provided by the CDCC that link RADx-UP research teams to testing, data, and community-engagement resources—helped research teams implement and adjust CDE collection, ensured alignment across consortium research teams, and assisted with other data-related issues that arose.

All RADx programs are required to collect a standardized set of CDEs, including sociodemographic, medical history, and health status elements with the intent to provide researchers rapid access to data for secondary research analyses in the RADx Data Hub, the central repository for RADx data. However, implementation of CDEs in the context of underserved communities in the rapidly evolving COVID-19 pandemic presented complex issues for consideration.

Some of these issues included data privacy, the risk of re-identification of underserved and undocumented populations, and data collection burden on participants as well as researchers. The privacy of health data is protected under federal law. The RADx-UP program instituted measures to ensure program participants’ data remain protected and de-identified using a token-based hashing algorithm methodology that allows researchers to share individual-level participant data without exposing personally identifiable information. To address data collection and respondent burden concerns, projects modified questions to allow some flexibility in expanding response options more appropriate to some underserved communities. The CDCC also developed COLECTIV, a digital interface for projects to directly enter data into the data repository and included gateway questions to relieve respondent burden.

Respect for Tribal Data Sovereignty

RADx-UP leadership and investigators recognized that additional considerations for tribal sovereignty, practices, and policies needed to be addressed for projects that include American Indian and Alaska Native (AI/AN) participants. Through consultations with the NIH Tribal Advisory Committee and the broader AI/AN community and meetings with an informal RADx-UP AI/AN project working group established by the CDCC, NIH realized that deposition of tribal data into the RADx Data Hub would not meet the cultural, governance, or sovereignty needs of AI/AN RADx research data. In response, NIH hopes to establish a RADx Tribal Data Repository (TDR) responsible for the collection, protection, and sharing of data collected in AI/AN communities with respect for the practices and policies of Tribal data sovereignty. Applications for the repository have been solicited and NIH hopes to make an award for the TDR sometime in FY23.

Rapid Data Sharing

One of the largest hurdles the RADx-UP program has faced is implementing rapid sharing of research data for secondary analyses and to inform decision-making and public health practices related to the COVID-19 pandemic. RADx-UP research teams are expected to share their data on a timely cadence before data collection ends. This is a far more stringent practice relative to the current standard NIH data-sharing policy that requires data to be shared at the time of acceptance for publication of the main findings from the final data set. NIH and CDCC staff have worked together with the RADx research community to highlight the importance of and compliance with rapid data-sharing. Within the first six months, a total of 69 Phase 1 projects began transmitting CDE data to the RADx-UP CDCC. The COVID-19 pandemic posed a tremendous challenge, and NIH responded by collaborating with vulnerable and underserved communities. This collaboration has opened an unprecedented opportunity to build on a now established foundation for future research to address gaps in understanding the broader social, cultural, and structural factors that influence disparities in morbidity and mortality from COVID-19 and other diseases. Data collection and sharing efforts of the RADx-UP initiative comprise a significant contribution. Collaboration among the NIH, research investigators, and communities impacted by COVID-19 has been the catalyst. To learn more about RADx-UP, please visit a recent journal article available on PubMed.


Dr. Hodes has served as NIA director since 1993, overseeing studies of the biological, clinical, behavioral, and social aspects of aging. He has devoted his tenure to the development of a strong, diverse, and balanced research program focused on the genetics and biology of aging, basic and clinical studies aimed at reducing disease and disability, and investigation of the behavioral and social aspects of aging. Ultimately, these efforts have one goal — improving the health and quality of life for older people and their families. As a leading researcher in the field of immunology, Dr. Hodes has published more than 250 peer-reviewed papers.

Dr. Pérez-Stable practiced primary care internal medicine for 37 years at the University of California, San Francisco before becoming the Director of NIMHD in 2015. His research interests have centered on improving the health of individuals from racial and ethnic minority communities through effective prevention interventions, understanding underlying causes of health disparities, and advancing patient-centered care for underserved populations. Recognized as a leader in Latino health care and disparities research, he spent 32 years leading research on smoking cessation and tobacco control in Latino populations in the United States and Latin America. Dr. Pérez-Stable has published more than 300 peer-reviewed papers.

The Next Normal: Supporting Biomedical Discovery, Clinical Practice, and Self-Care

As we start year three of the COVID-19 pandemic, it’s time for NLM to take stock of the parts of our past that will support the next normal and what we might need to change as we continue to fulfill our mission to acquire, collect, preserve, and disseminate biomedical literature to the world.

Today, I invite you to join me in considering the assumptions and presumptions we made about how scientists, clinicians, librarians and patients are using critical NLM resources and how we might need to update those assumptions to meet future needs. I will give you a hint… it’s not all bad—in fact, I find it quite exciting!

Let’s highlight some of our assumptions about how people are using our services, at least from my perspective. We anticipated the need for access to medical literature across the Network of the National Library of Medicine and created DOCLINE, an interlibrary loan request routing system that quickly and efficiently links participating libraries’ journal holdings. We also anticipated that we were preparing the literature and our genomic databases for humans to read and peruse. Now we’re finding that more than half of the accesses to NLM resources are generated and driven by computers through application programming interfaces. Even our MedlinePlus resource for patients now connects tailored electronic responses through MedlinePlus Connect to computer-generated queries originating in electronic health records.

Perhaps, and most importantly, we realize that while sometimes the information we present is actually read by a living person, other times the information we provide—for example, about clinical trials (ClinicalTrials.gov) or genotype and phenotype data (dbGaP)—is actually processed by computers! Increasingly, we provide direct access to the raw, machine-readable versions of our resources so those versions can be entered into specialized analysis programs, which allow natural-language processing programs to find studies with similar findings or machine-learning models to determine the similarities between two gene sequences. For example, NLM makes it possible for advocacy groups to download study information from all ClinicalTrials.gov records so anyone can use their own programs to point out trials that may be of interest to their constituents or to compare summaries of research results for related studies.

Machine learning and artificial intelligence have progressed to the point that they perform reasonably well in connecting similar articles—to this end, our LitCovid open-resource literature hub has served as an electronic companion to the human curation of coronavirus literature. NLM’s LitCovid is more efficient and has a sophisticated search function to create pathways that are more relevant and are more likely to curate articles that fulfill the needs of our users. Most importantly, innovations such as LitCovid help our users manage the vast and ever-growing collection of biomedical literature, now numbering more than 34 million citations in NLM’s PubMed, the most heavily used biomedical literature citation database.

Partnerships are a critical asset to bring biomedical knowledge into the hands (and eyes) of those who need it. Over the last decade, NLM moved toward a new model for managing citation data in PubMed. We released the PubMed Data Management system that allows publishers to quickly update or correct nearly all elements of their citations and that accelerates the delivery of correct and complete citation data to PubMed users.

As part of the MEDLINE 2022 Initiative, NLM transitioned to automated Medical Subject Headings (MeSH) indexing of MEDLINE citations in PubMed. Automated MeSH indexing significantly decreases the time for indexed citations to appear in PubMed without sacrificing the quality MEDLINE is known to provide. Our human indexers can focus their expertise on curation efforts to validate assigned MeSH terms, thereby continuously improving the automated indexing algorithm and enhancing discoverability of gene and chemical information in the future.

We’re already preparing for the next normal—what do you think it will be like?

I envision making our vast resources increasingly available to those who need them and forging stronger partnerships that improve users’ ability to acquire and understand knowledge. Imagine a service, designed and run by patients, that could pull and synthesize the latest information about a disease, recommendations for managing a clinical issue, or help a young investigator better pinpoint areas ripe for new interrogation! The next normal will make the best use of human judgment and creativity by selecting and organizing relevant data to create a story that forms the foundation of new inquiry or the basis of new clinical care. Come along and help us co-create the next normal!

Meet the NLM Investigators: For Sameer Antani, PhD, Seeing is More Than Meets the Eye

It’s time for another round of introductions! Many of you may already know Sameer Antani, PhD—one of NLM’s most decorated and prestigious investigators—from his many awards and accolades. In March 2022, he was inducted into the American Institute for Medical and Biological Engineering’s College of Fellows, an impressive group that represents the top two percent of medical and biological engineers. This distinction is one of the highest honors that can be bestowed upon a medical and biological engineer. Can you tell we are proud of him?!  

We selected Dr. Antani to join our NLM family after a nationwide, competitive search, and his genius was readily apparent from the start. Dr. Antani’s career spans over two decades, during which he developed an innovative research portfolio focused on machine learning and artificial intelligence (AI). His lab at NLM focuses on using these tools to analyze enormous sets of biomedical data. Through this analysis, AI technology can “learn” to detect disease and assist health care professionals provide more efficient diagnoses. Examples of Dr. Antani’s work can be found in mobile radiology vehicles, which allow professionals to take chest X-rays and screen for HIV and tuberculosis using software containing algorithms developed in his lab. Check out the infographic below to learn more about the exciting research happening in Dr. Antani’s lab.

Infographic titled: Seeing is more than meets the eye. Under the title the investigator's name, title and division are listed as: Sameer Antani, PhD, Investigator, Computational Health. 

The first column of the infographic is titled: Projects. Two bullets are listed in the first column. The first bullet reads: Discovering the impact of data on automated AI and machine learning (AI/ML) processes on diagnostics. The second bullet says: Improving AI/ML algorithm decisions to be consistent, reproducible, portable, explainable, unbiased, and representative of severity.

The second column is titled: Process. The first bullet in this column reads: Using images and videos alongside AIML technology to identify and diagnose:
Cancers: Cervical, Oral, Skin (Kaposi Sarcoma)
Cardiomyopathy 
Cardiopulmonary diseases. 
The second bullet reads: Analyzing a variety of image types, including:
Computerized Tomography (CT), Magnetic Resonance Imaging (MRI), X-ray, ultrasound, photos, videos, microscopy. 

The third and final column in the infographic is titled: What It Looks Like. In this column there are four images of chest x-rays illustrating the detection of HIV and TB.

Now, in his own words, learn more about what makes Dr. Antani’s work so important!

What makes your team unique? Tell us more about the people working in your lab.   

The postdoctoral research fellows, long-term staff scientists, and research scientists on my team explore challenging computational health topics while simultaneously advancing topics in machine learning for medical imaging. Dr. Ghada Zamzmi, Dr. Peng Guo, and Dr. Feng Yang bring expertise and drive to our lab. The scientists on my team, Dr. Zhiyun (Jaylene) Xue and Dr. Sivarama Krishnan Rajaraman, add over two decades of combined research and mentoring experience.  

What do you enjoy about working at NLM?  

There are many positives about working at NLM. At the top of the list is the encouragement and support to explore cutting-edge problems in medical informatics, data science, and machine intelligence, among other initiatives. 

What is your advice for young scientists or people interested in pursuing a career in research?  

I urge young scientists to recognize the power of multidisciplinary teams. I would also urge them to develop skills to clearly communicate their goals and research interests with colleagues who might be from a different domain so they can effectively collaborate and arrive at mutually beneficial results. 

Where is your favorite place to travel?

I like to travel to places that exhibit the natural wonders of our planet. I hope to visit all our national parks someday. 

When you’re not in the lab, what do you enjoy doing?

I am studying and exploring different aspects of music structure.

You’ve read his words, and now you can hear him for yourself! Follow our NLM YouTube page for more exciting content from the NLM staff that make it all possible. If you’d like to learn more about our IRP program, view job opportunities, and explore research highlights, I invite you to explore our recently redesigned NLM IRP webpage.

YouTube: Sameer Antani and Artificial Intelligence

Transcript: [Antani]: I went to school for computer engineering in India. I’ve worked with image processing, computer vision, pattern recognition, machine learning. So my world was filled with developing algorithms that could extract interesting objects from images and videos. Pattern recognition is a family of techniques that looks for particular pixel characteristics or voxel characteristics inside an image and learns to recognize those objects. Deep learning is a way of capturing the knowledge inside an image and encapsulating it, and then researchers like me spend time advancing newer deep-learning networks that look more broadly into an image, recognizing these objects—recognizing organs, in my case, and diseases—and converting those visuals into numerical risk predictors that could be used by clinicians.

So my research is currently in three very different areas. One area looks at cervical cancer. A machine could look at the images and be a very solid predictor of the risk to the woman of developing cervical precancer, encouraging early treatment. Another area I work with [is] sickle cell disease. One of the risk factors in sickle cell disease is cardiac myopathy or cardiac muscle disease, which leads to stroke and perhaps even death. Looking at cardiac echo videos and using AI to be a solid predictor, along with other blood lab tests, improves the chances of survival.

A third area that I’m interested in is understanding the expression of tuberculosis [TB] in chest X-rays, particularly for children and those that are HIV-positive. The expression of disease in that subpopulation is very different from adults with TB who are not HIV positive. Every clinician has seen a certain number of patients in their clinical training. They perhaps have spent more time at hospitals or clinical centers, been exposed to a certain population, and they become very adept at that population. Machines, on the other hand, could be trained on data that is free of bias, from different parts of the world, different ethnicities, different age groups, so that there’s an improved caregiving and therefore, a better expectation on treatment and care.

Note: Transcript was modified for clarity.

Bridging the Resource Divide for Artificial Intelligence Research

This blog post is by Lynne Parker, Director, National AI Initiative Office and was originally posted on the White House Office of Science and Technology Policy blog. The Office of Science and Technology Policy and the National Science Foundation are seeking comments on the initial findings and recommendations contained in the interim report of the National Artificial Intelligence Research Resource (NAIRR) Task Force (“Task Force”) and particularly on potential approaches to implement those recommendations. We encourage you to read the RFI and submit comments on Implementing Initial Findings and Recommendations of the National Artificial Intelligence Research Resource Task Force by June 30, 2022.

Artificial Intelligence (AI) is transforming our world. The field is an engine of innovation that is already driving scientific discovery, economic growth, and new jobs. AI is an integral component of solutions ranging from those that tackle routine daily tasks to societal-level challenges, while also giving rise to new challenges necessitating further study and action. Most Americans already interact with AI-based systems on a daily basis, such as those that help us find the best routes to work and school, select the items we buy, and ask our phones to remind us of upcoming appointments.

Once studied by few, AI courses are now among the most popular across America’s universities. AI-based companies are being founded and scaled at a rapid rate. Worldwide AI-related research publications and patent applications continue to climb. 

However, this growth in the importance of AI to our future and the size of the AI community obscures the reality that the pathways to participate in AI research and development (R&D) often remain limited to those with access to certain essential resources. Progress at the current frontiers of AI is often tied to the use of large volumes of advanced computational power and data, and access to those resources today is too often limited to large technology companies and well-resourced universities. Consequently, the breadth of ideas and perspectives incorporated into AI innovations can be limited and lead to the creation of systems that perpetuate biases and other systemic inequalities.

This growing resource divide has the potential to adversely skew our AI research ecosystem, and in the process, threaten our Nation’s ability to cultivate an AI research community and workforce that reflects America’s rich diversity – and harness AI in a manner that serves all Americans. To prevent unintended consequences or disparate impacts from the use of AI, it matters who is doing the AI research and development.

Established in June 2021 pursuant to the National AI Initiative Act of 2020, the National AI Research Resource (NAIRR) Task Force has been seeking to address this resource divide. As a Congressionally-chartered Federal advisory committee, the NAIRR Task Force has been developing a plan for the establishment of a National AI Research Resource that would democratize access to AI R&D for America’s researchers and students. The NAIRR is envisioned as a broadly available and federated collection of resources, including computational infrastructure, public- and private-sector data, and testbeds. These resources would be made easily accessible in a manner that protects privacy, with accompanying educational tools and user support to facilitate their use. An important element of the NAIRR will be the expertise to design, deploy, federate, and operate these resources.

Since its establishment, the Task Force has held 7 public meetings, engaged with 39 experts on a wide range of aspects related to the design of the NAIRR, and considered 84 responses from the public to a request for information (RFI). Materials from all public meetings and responses to the RFI can be found at www.AI.gov/nairrtf.

Today, as co-chair of the Task Force and as part of OSTP’s broader work to advance the responsible research, development, and use of AI, I am proud to announce the submission of the interim report of the NAIRR Task Force to the President and Congress. This report lays out a vision for how this national cyberinfrastructure could be structured, designed, operated, and governed to meet the needs of America’s research community. In the report, the Task Force presents an approach to establishing the NAIRR that builds on existing and future Federal investments; designs in protections for privacy, civil rights, and civil liberties; and promotes diversity and equitable access. It details how the NAIRR should support the full spectrum of AI research – from foundational to use-inspired to translational – by providing opportunities for students and researchers to access resources that would otherwise be out of their reach. The vision laid out in this interim report is the first step towards a more equitable future for AI R&D in America – a future where innovation can flourish and the promise of AI can be realized in a way that works for all Americans.

Going forward, the Task Force will develop a roadmap for achieving the vision defined in the interim report. This implementation roadmap is planned for release as the final report of the Task Force at the end of this year. To inform this work, we are asking for feedback from the public on the findings and recommendations presented in the interim report as well as how those recommendations could be effectively implemented. Public responses to this request for information will be accepted through June 30, 2022. In addition, OSTP and the National Science Foundation will host a public listening session on June 23 to provide additional means for public input. Please see here for more information on how to participate.

If successful, the NAIRR would transform the U.S. national AI research ecosystem by strengthening and democratizing foundational, use-inspired, and translational AI R&D in the United States. The interim report of the NAIRR Task Force being released today represents a first step towards this future, putting forward a vision for the NAIRR for public comment and feedback.

%d bloggers like this: