Common Data Elements: Increasing FAIR Data Sharing

Guest post by Carolina Mendoza-Puccini, MD, CDE Program Officer, Division of Clinical Research, National Institute of Neurological Disorders and Stroke (NINDS) and Kenneth J. Wilkins, PhD, Mathematical Statistician, Biostatistics Program and Office of Clinical Research Support, Office of the Director, National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK)

Previous posts published in Musings from the Mezzanine have explained the importance of health data standards and their role as the backbone of interoperability. Common Data Elements (CDEs) are a type of health data standard that is commonly used and reused in both clinical and research settings. CDEs capture complex phenomena, like depression, or recovery, through standardized, well defined questions (variables) that are paired with a set of allowable responses (values) that are used in a standardized way across studies or trials.

CDEs provide a way to standardize data collection—ensuring that data are collected consistently, and otherwise-avoidable variability is minimized.

Where possible, CDEs are linked to controlled vocabularies and terminologies commonly used in health care, such as SNOMED-CT and LOINC, and CDEs can provide a route to harmonize with non-prospective clinical research designs. Such links leverage common data entities, like clinical concepts underlying common data models, to align evidence of clinical studies with evidence from ‘real-world data’ such as electronic health records (EHRs), mobile/wearables, and patient-reported outcomes, what’s become known in recent years as ‘real world evidence’.

Importance of CDEs for Interoperability and Consistency of Evidence Across Settings

FAIR Data Principles (Source: National Institute of Environmental Health Sciences)

NIH’s response to the COVID-19 pandemic highlighted the importance of developing CDEs that can be used and endorsed across NIH-funded COVID-19 research so that resulting, urgently-needed data would be FAIR: Findable, Accessible, Interoperable, and Reusable.

Many groups across NIH identified, or are in the process of identifying, CDEs that are both COVID-19-related, and related to the needs of specific research projects such as NIH’s Disaster Research Response (DR2) program, Rapid Acceleration of Diagnostics—Underserved Populations (RADx-UP) and Researching COVID to Enhance Recovery (RECOVER) initiatives. There was also a need to develop a process for indicating NIH endorsement of CDEs that meet meaningful criteria, are made available through a common discovery platform (the NIH CDE Repository), and avoid duplicating functions of resources that already exist.

NIH’s Scientific Data Council charged a group of members of the NIH CDE Task Force, the CDE Governance Committee (Governance Committee), to develop this endorsement process based on the following criteria:

  • Clear definition of variable/measure with prompt and response 
  • Documented evidence of reliability and validity, where applicable
  • Human- and machine-readable formats
  • Recommended/designated by a recognized NIH body (Institute, Center, Office, Program/Project Committee, etc.)
  • Clear Licensing and Intellectual Property status (prefer Creative Commons or open source)

The role of the Governance Committee is to assure that the evidence of acceptability, reusability, and validity is properly presented and documented.

Submission of CDEs for Endorsement

The Governance Committee determined that CDEs will be submitted either as “Individual CDEs” or “Bundles.” Individual CDEs can be collected separately. Bundles are a group of questions or variables with specified sets of allowable responses that are grouped together and used as a set. Bundles may include standardized instruments, such as the Patient Health Questionnaire 9 (PHQ-9) Depression Scale, or a number of questions that must be collected as a group to maintain their meaning as individual elements (e.g., demographic features).

The Governance Committee will conduct a review of submissions based on the endorsement criteria approved. Once endorsed, Individual CDEs and possibly Bundles will be published in the NIH CDE Repository with an endorsement badge.

Reuse of NIH-endorsed CDEs Going Forward

With these governance-endorsed additions to the NIH CDE Repository, its role as a unified resource for common data entities and semantic concepts (the conceptual underpinnings of common data elements themselves) will lay the groundwork for researchers (NIH-funded or otherwise) to plan on interoperable data features. With the endorsement criteria and NLM-led efforts to enhance the NIH CDE Repository as an NIH-wide research resource, its role can grow along with those of related public and private sector alignment efforts. These include standards ranging from the United States Core Data for Interoperability for routine health care to the FDA submission standards within the Clinical Data Interchange Standards Consortium (CDISC) for treatments and preventive therapeutics, like vaccines, that we all rely upon for quality care.

Features to the NIH CDE Repository will continue to be enhanced—whether to search for semantically-related concepts or to highlight subtle distinctions among closely related CDEs. The NIH CDE Repository can also serve as a clearinghouse for interoperability in data from across a broad range of research, from prospectively-designed studies to those making use of data captured in the course of clinical care (such as EHRs) yet repurposed for real-world evidence.

In the wake of lessons learned from the most challenging aspects of early COVID-19 research, CDE use can increase FAIR data sharing across the research ecosystem in the near-seamless fashion just as envisioned by legislators when they enacted the 21st Century Cures Act. CDE governance processes are poised to adapt accordingly and to keep working toward greater data interoperability within this post-COVID-19 pandemic era.

CDE Governance Committee Members: Matt McAuliffe (Center for Information Technology), Kerry Goetz (National Eye Institute), Denise Warzel (National Cancer Institute), Erin Ramos (National Human Genome Research Institute), Jyoti Dayal (National Human Genome Research Institute), Deborah Duran (National Institute on Minority Health and Health Disparities), Janice Knable (National Cancer Institute). Chairs: Carolina Mendoza-Puccini (National Institute of Neurological Disorders and Stroke) and Kenneth Wilkins (National Institute of Diabetes and Digestive and Kidney Diseases). Ex Officio members: Robin Taylor, Mike Huerta, Lisa Federer (National Library of Medicine). Collaborator: Greg Farber (National Institute of Mental Health).

To learn more about the NIH Common Data Elements (CDE) Repository, watch this short video.

Dr. Mendoza-Puccini leads the NINDS Common Data Elements Project and is a Program Officer at the NINDS Division of Clinical Research.

Dr. Wilkins is a member of both the NIH-wide and NIDDK-specific Data Science and Data Management Working Groups and engages with researchers from across intramural and extramural programs on quantitative aspects of design and analysis.

NIH Strategically, and Ethically, Building a Bridge to AI (Bridge2AI)

This piece was authored by staff across NIH that serve on the working group for the NIH Common Fund’s Bridge2AI program—a new trans-NIH effort to harness the power of AI to propel biomedical and behavioral research forward.

The evolving field of Artificial Intelligence (AI) has the potential to revolutionize scientific discovery from bench to bedside. The understanding of human health and disease has vastly expanded as a result of research supported by the National Institutes of Health (NIH) and others. Every discovery and advance in contemporary medicine comes with a deluge of data. These large quantities of data, however, still result in restricted, incomplete views into the natural processes underlying human health and disease. These complex processes occur across the “health-disease” spectrum over temporal scales – sub-seconds to years – and biological scales – atomic, molecular, cellular, organ systems, individual to population. AI provides the computational and analytical tools that have the potential to connect the dots across these scales to drive discovery and clinical utility from all of the available evidence.

A new NIH Common Fund program, Bridge to Artificial Intelligence (Bridge2AI), will tap into the power of AI to lead the way toward insights that can ultimately inform clinical decisions and individualize care. AI, which encompasses many methods, including modern machine learning (ML), offers potential solutions to many challenges in biomedical and behavioral research.

AI emerged in the 1960s and has evolved substantially in the past two decades in terms of its utility for biomedical research. The impact of AI for biomedical and behavioral research and clinical care derives from its ability to use computer algorithms to quickly find connections from within large data sets and predict future outcomes. AI is already used to improve diagnostic accuracy, increase efficiency in workflow and clinical operations, and facilitate disease and therapeutic monitoring, to name a few applications. To date, the FDA has approved more than 100 AI-based medical products.

AI-assisted learning and discovery is only as good as the data used to train it. 

The use of AI/ML modeling in biomedical and behavioral research is limited by the availability of well-defined data to “train” AI algorithms to learn how to recognize patterns within the data. Existing biomedical and behavioral data sets rarely include all necessary information as they are collected on relatively small samples and lack the diversity of the U.S. population. Data from a variety of sources are necessary to characterize human health, such as those from -omics, imaging, behavior, and clinical indicators, electronic health records, wearable sensors, and population health summaries. The data generation process itself involves human assumptions, inferences, and biases that must be considered in developing ethical principles surrounding data collection and use. Standardizing collection processes is challenging and requires new approaches and methods. Comprehensive, systematically generated and carefully collected data is critical to build AI models that provide actionable information and predictive power. Data generation remains among the greatest challenges that must be resolved for AI to have a real-world impact on medicine.

Bridge2AI is a bold new initiative at the National Institutes of Health designed to propel research forward by accelerating AI/ML solutions to complex biomedical and behavioral health challenges whose resolution lies far beyond human intuition. Bridge2AI will support the generation of new biomedically relevant data sets amenable to AI/ML analysis at scale; development of standards across multiple data sources and types; production of tools to accelerate the creation of FAIR (Findable, Accessible, Interoperable, Reusable) AI/ML-ready data; design of skills and workforce development materials and activities; and promotion of a culture of diversity and ethical inquiry throughout the data generation process.

Bridge2AI plans to support several Data Generation Projects and an Integration, Dissemination and Evaluation (BRIDGE) Center to develop best practices for the use of AI/ML in biomedical and behavioral research. For additional information, see NOT-OD-21-021 and NOT-OD-21-022. Keep up with the latest news by visiting the Bridge2AI website regularly and subscribing to the Bridge2AI listserv.

Top Row (left to right):
Patricia Flatley Brennan, RN, PhD, Director, National Library of Medicine
Michael F. Chiang, MD, Director, National Eye Institute
Eric Green, MD, PhD, Director, National Human Genome Research Institute
 
Bottom Row (left to right):
Helene Langevin, MD, Director, National Center for Complementary and Integrative Health
Bruce J. Tromberg, PhD, Director, National Institute of Biomedical Imaging and Bioengineering

Walking in Each Other’s Shoes: Fostering Interdisciplinary Research Collaboration

Guest post by Teresa Przytycka, PhD, Senior Investigator, Computational and Systems Biology section of the Computational Biology Branch at the National Library of Medicine’s National Center for Biotechnology Information, National Institutes of Health

If you pose the question – “What is the difference between computational biology and bioinformatics?”– you will get many contradictory answers. Terminology aside, most researchers would agree that the space between traditional biology and traditional computer science is wide enough to accommodate many different models of collaborations between these two groups of researchers.  

Bioinformatics analysis, which involves the analysis of biological data such as DNA, RNA, and protein sequences, has become a standard step required after many types of now routine experiments. For example, after performing an experiment measuring gene expression in different conditions, bioinformatics analysis is likely to be used to compare gene expression between these conditions. In this setting, while experimental and computational components are necessary for the success of the project, only limited interaction between the experimentalist (the user of the tool) and the computational expert (the producer of the tool) is required.

However, given the richness of biomedical data and the complexity of the relations between various bimolecular entities, such as genes and proteins, researchers can be challenged to ask questions that cannot be answered through traditional means. In such cases, the user-producer model of collaboration is increasingly replaced by a different model of collaboration where biologists and computer scientists work side-by-side to both formulate and answer questions. Such collaborations across disciplines can introduce new perspectives and approaches to spur innovation and open the door to addressing new challenging questions.

Recognizing the need to foster interdisciplinary science, NIH formed an interdisciplinary committee to explore the development of a systems biology center at NIH in 2008. The driving idea behind such a center was to create a space where people of different backgrounds can mix, exchange ideas, and through these exchanges come to solutions to open biomedical questions enabled by interdisciplinary approaches. While this effort did not result in the establishment of a physical center, per se, it did give rise to a network of interdisciplinary collaborations. Interestingly, two of the collaborations NLM started at the time still continue today:  the collaboration with the Center for Cancer Research at the National Cancer Institute focusing on conformational dynamics of DNA structure, and the collaboration with the Laboratory of Cellular and Developmental Biology’s Developmental Genomics Section at the National Institute of Diabetes and Digestive and Kidney Diseases studying various aspects of gene regulation.  

What made these collaborations successful and long-lasting?  

Perhaps most importantly, the process of bringing experimental and computational groups together calls for a willingness to learn each other’s languages, thought processes, and cultures — or the proverbial walking in someone else’s shoes. 

Over the years, we learned to think together, work together, and publish many papers together. For example, in one of our joint projects, our experimental partners collected data that helped the computational group construct the gene regulatory network for a fly that can later be utilized by other researchers studying this model organism. In addition to the joy of advancing discovery together, these collaborations opened doors to foster synergies among young computational and experimental researchers from the collaborating groups. These interactions have enriched their NIH experience in important ways, giving them skills that they are likely to find very useful in the future — whether they go on to build their own research groups or follow other career choices.

I am confident that NLM’s support for interdisciplinary research through scientific collaborations will continue to spur innovation and discovery. It will also help to train a generation of researchers who can seamlessly work with people of difference scientific backgrounds.    

What do you value in your collaborations?

Teresa M. Przytycka, PhD, leads the Algorithmic Methods in Computational and Systems Biology section at the National Center for Biotechnology Information. Dr. Przytycka is particularly interested in the dynamical properties of biological systems, including spatial, temporal and contextual variations, and exploring how such variations impact gene expression, the functioning of biological pathways, and the phenotype of the organism.

NLM Announces New Annual Lecture on Science, Technology, and Society

Guest post by Maryam Zaringhalam, PhD, National Library of Medicine Data Science and Open Science Officer and Mike Huerta, PhD, director of the Office of Strategic Initiatives and associate director of the National Library of Medicine.

In October 2019, NLM invited award-winning science journalist Angela Saini to discuss her research on how bias and prejudice have crept into science. Her lecture examined how racist and sexist ideas have permeated science over its history — and how science, in turn, has been contorted to justify and perpetuate pseudoscientific myths of innate inferiority. Saini’s work and insights sparked a crucial conversation within NLM about our role and responsibility as the world’s largest biomedical library and a leader in data science research, situated within the nation’s premiere medical research agency, to question how systemic biases affect our work and determine how we can correct them.

As advancing equity and rooting out structural discrimination in science and technology have become an increasingly urgent federal priority, NLM will build on this discussion, in part, by announcing the launch of an annual NLM Science, Technology, and Society Lecture on March 1, 2021.

Situated at the nexus of the NIH-supported research community and the public, NLM plays a vital role not only in advancing cutting-edge research, but also in acting as a steward of biomedical information in service of society. As leaders in facilitating and shaping the future of biomedical data science, we must understand the implications of our work for society as a whole. We must, for instance, question how biases may creep into algorithms that connect research results with the public and think through the ethical ramifications of emerging technologies that might reinforce and amplify those biases. As a national library, we serve as curators of the history of biomedical science, which must reflect both the great achievements made possible by research and the injustices committed within the scientific community. And as an institution with more than 8,000 points of presence through our Network of the National Library of Medicine, we have the means to fulfill our responsibility to meet the needs and understand the concerns of the communities we serve.

With these responsibilities along with NLM’s unique role and capabilities in mind, the NLM Lecture on Science, Technology, and Society Lecture aims to raise awareness around the societal and ethical implications of the conduct of biomedical research and the use of advanced technologies, while seeding conversations across the Library, NIH, and the broader biomedical research community. NLM sees such considerations as fundamental to advancing biomedical discovery and human health for the benefit of all.

Dr. Kate Crawford is the inaugural Visiting Chair of AI and Justice at the École Normale Supérieure, as well as a Senior Principal Researcher at Microsoft Research, and the cofounder of the AI Now Institute at New York University.

Each spring, we plan to invite a leading voice working at the intersection of biomedicine, data science, ethics, and justice to present their research and how it relates to the mission and vision of NLM, as well as NIH more broadly. This year, we are pleased to host Dr. Kate Crawford, a leading scholar of science, technology, and society, with over 20 years of experience studying large scale data systems and artificial intelligence (AI) in the wider contexts of history, politics, labor, and the environment. Her lecture, “Atlas of AI: Mapping the social and economic forces behind AI”, will explore how machine learning systems can reproduce and intensify forms of structural bias and discrimination and offer new paths for thinking through the research ethics and policy implications of the turn to machine learning.

As the interests, priorities, and concerns of our society continue to evolve, particularly in response to emerging technologies and shifting national conversations, we hope this annual lecture, alongside established lecture series such as NLM History Talks, will provide an invaluable perspective on the societal implications of our work and further establish NLM’s leadership as a trusted partner in health.

Dr. Zaringhalam is a member of the Office of Strategic Initiatives and is responsible for monitoring and coordinating data science and open science activities and development across NLM, NIH, and beyond. She completed her PhD in molecular biology at Rockefeller University in 2017 before joining NLM as an AAAS Science and Technology Policy Fellow.

Dr. Huerta leads NLM in identifying, implementing, and assessing strategic directions of NLM, including at the intersection of data science and open science. In his 30 years at NIH, he has led many trans-NIH research initiatives and helped establish neuroinformatics as a field. Dr. Huerta joined NIH’s National Institute of Mental Health in 1991, before moving to NLM in 2011.

A Journey to Spur Innovation and Discovery

Guest post by Valerie Schneider, PhD, staff scientist at the National Library of Medicine’s National Center for Biotechnology Information, National Institutes of Health.

It’s been said that nature is the best teacher. When it comes to understanding human biology and improving health, examples abound of the advances that have been made from the study of a diverse set of non-human organisms. Over the last two centuries, the study of nematode worms has taught us about longevity and mRNAs (the biological molecule that is the basis for several COVID-19 vaccines), common fungi about cell division and cancer, and fruit flies about many things, from the role of chromosomes in heredity to our circadian rhythms. The ability to create targeted alterations in the genomes of model organisms has been transformative for studies to establish the function of specific genes in the etiology of human disease.

The modern era of genomic biology, in which genome sequencing and assembly are accessible to more researchers than ever before, provides data from an even greater range of organisms from which we might learn. Today, we rely not only on primate models, but on a whole host of species: for example, swine to understand organ transplantation, songbirds to understand vocalization and learning, and bats and pangolins to teach us about the evolution of the SARS-CoV-2 virus and how to fight its spread.

These rapidly growing collections of sequence and other data on species across the tree of life offer enormous promise for discoveries that have the potential to improve human health. To better enable such discoveries, with the support of NIH, NLM is planning a major modernization of its resources and their underlying infrastructure.

This modernization will support the needs of users engaged in data search and retrieval, gene annotation, evaluation of sequence quality, and comparative analyses. The new infrastructure, user interfaces, and tools should result in an improved experience for researchers doing a wide range of work, and also facilitate better data submissions.

This revamping aligns with NIH’s Strategic Plan for Data Science, which provides a roadmap for modernizing the NIH-funded biomedical data science ecosystem, as well as NLM’s Strategic Plan, which furthers NLM’s commitment to provide data and information to accelerate biomedical discovery and improve health. NLM and NIH are committed to providing researchers with modern, stable, and cloud-oriented technologies that support research needs.

Over the last few years, NLM has demonstrated this commitment by re-designing several flagship products, including the PubMed database for searching published biomedical literature, the ClinicalTrials.gov database of information on privately and publicly funded clinical trials, and the Basic Local Alignment Search Tool (BLAST) for finding regions of similarity between biological sequences. As part of NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative, NLM also made the data from its massive (36 petabyte) Sequence Read Archive (SRA) available on two commercial cloud platforms, facilitating large-scale computational research that would otherwise be difficult for many researchers. Revamping these resources has positioned them to support both the current and future needs of NLM’s diverse audience of researchers, clinicians, data scientists, educators and others.

Importantly, this current initiative to modernize NLM products, tools, and services, and concurrently develop content, will include extensive engagement with the research community, just as we’ve done with previous re-design efforts. The NLM is committed to offering interfaces accessible to both novices and experts. Additionally, NLM believes a key part of the next generation of its data resources requires an infrastructure that supports an ongoing, dynamic exchange of content, including contributions of metadata and gene functional information from knowledge builders in the community to complement and enhance NIH-provided content.

Community engagement will also ensure that externally sourced content is provided in ways that maintain the high value and trustworthiness of the datasets. Additionally, data connections that make the content of this new resource accessible to external knowledgebases containing other datatypes, such as images, will further promote integrative data analyses that support scientific discovery.

Many opportunities exist to streamline processes, look across resources, and gain insights that will provide new ways of learning. Through NLM’s continued commitment to modernization initiatives, we are ready to again improve the user experience for accessing, analyzing and visualizing sequence data and related information. Nature continues to be our best teacher — and we are now poised to learn from her in an exciting new classroom.

We invite you to come on this journey with us.

Valerie Schneider, PhD, is the deputy director of Sequence Offerings and the head of the Sequence Plus program. In these roles, she coordinates efforts associated with the curation, enhancement, and organization of sequence data, as well as oversees tools and resources that enable the public to access, analyze, and visualize biomedical data. She also manages NCBI’s involvement in the Genome Reference Consortium, the international collaboration tasked with maintaining the value of the human reference genome assembly.