Continuous Innovation Framework for NLM’s Biomedical Data Repositories

Guest post by Kim D. Pruitt, PhD, Acting Director of NLM’s National Center for Biotechnology Information and Associate Director for Scientific Data, and Valerie Schneider, PhD, Acting Chief of the Information Engineering Branch, NLM National Center for Biotechnology Information.

NLM manages many biomedical information services that support the scientific enterprise and recognizes the importance of ongoing investment and innovation. Underlying data repositories, system architectures, and web interfaces have limited lifespans and consequently do not last “to infinity and beyond.” After their initial design and launch, factors such as changing user needs, increasing data volume, more modern design practices and interface designs, and the changing cybersecurity environment all mean these resources must be updated to remain relevant and meet new challenges. To do so, we invest in both incremental feature expansions over time and periodic modernization initiatives that reengineer the entire platform technology stack.

We have pursued several such modernization initiatives in recent years that leverage cloud resources to give NLM more options to flexibly meet data growth challenges, fulfill high usage demands, and provide more responsive user services. Adopting strategies such as cloud technology illustrates NLM’s commitment to keep its products and services innovative.

Wrapping up the redesign of ClinicalTrials.gov

NLM is nearing the end of an initiative to redesign the ClinicalTrials.gov website and the supporting Protocol Registration and Results Submission (PRS) system. This effort integrates innovative technical solutions such as cloud technology and artificial intelligence. New features include a secure and reliable public website that is responsive for mobile devices and allows for future growth, as well as a modern search engine that analyzes large amounts of data and delivers fast search response times.

Another improvement currently undergoing testing is a machine learning-based approach to support the review of information submitted to the PRS system prior to its release to the public-facing website. Natural language processing algorithms automatically scan for several different types of common major reporting issues found in submissions of study outcome measures. This new approach is expected to reduce the time needed to identify common issues across submissions and ultimately improve the overall quality of information posted on ClinicalTrials.gov.

Four years into the CGR initiative

The NIH Comparative Genomics Resource (CGR) is designed to maximize the impact of eukaryotic organisms’ genomic data to biomedical research and create new possibilities for scientific advancement. The CGR initiative, now in its fourth year, is using community collaboration to create reliable comparative genomics analyses for all eukaryotic organisms and providing a toolkit of genomic resources to improve data management, visualization, analysis, and delivery.

NCBI Datasets, a component of the CGR toolkit, delivers new web-based and programmatic interfaces that allow users to retrieve genome packages with a diverse and user-configurable selection of data files and a metadata report that previously required multiple queries—as well as a deep understanding of NLM databases—to obtain.

NLM is also increasing the quality of genomic data submitted to GenBank to make the data more useful for all. The Foreign Contamination Screen genome cross-species aligner (FCS-GX) is a new cloud-compatible tool that lets data submitters detect sequences from unintended sources in the genomes they produce. This ensures, for example, that the genome from an earthworm does not accidentally include sequences from microorganisms found in the same soil. The CGR toolkit also includes the Comparative Genomics Viewer (CGV), an entirely new tool that enables researchers to visually explore the alignment of genome sequences from different organisms to one another and identify potentially meaningful biological differences.

dbGaP and responsible research data sharing

This framework of continuous innovation is also being applied to the database of Genotypes and Phenotypes (dbGaP). dbGaP provides controlled access to data from large-scale studies that examine the relationship between human genes (genotype information) and observable characteristics (phenotypes). Many people, motivated by the opportunity to advance scientific knowledge and help alleviate human suffering, generously consent to participate in research studies on cancer, diabetes, genetic disorders, and other diseases. NIH responsibly manages these data with both technical and policy systems to balance research use with privacy protections for study participants.

dbGaP is a key node in this landscape of responsible research participant data management and sharing. Data in dbGaP now include more than 2,500 studies and comprise 3.8 million research participants who have consented to data sharing. When datasets and scientific data are shared with other researchers as a community resource, the impact can go far beyond the goals of the original study; by taking additional steps to organize and share data for broader use, the data become a renewable resource to drive a continuous cycle of innovation—like a waterwheel that powers scientific advances with each turn.

Consistent with a broader effort to ensure coordination of cross-NIH controlled data access, we are modernizing the dbGaP controlled data access approval processes by streamlining the infrastructure, interfaces, and functionality that support searching for, requesting access to, and reviewing requests to access data in dbGaP. Additionally, dashboards and reports will be developed and improved so researchers can quickly and reliably find studies and data relevant to their needs. The overarching goal is to reduce the burden for a researcher to identify datasets of interest to their research project and request access to them, for the relevant NIH Data Access Committee to review a request and determine appropriate access to requested data, and for overall system monitoring by NIH officers. This effort will make it easier to apply for access to research study data, streamline the user experience, and enhance system performance.

Our continued commitment to innovation

NLM continues to enhance its information services in response to changing usage patterns, changing scientific needs, customer feedback, and infrastructure needs. NLM recognizes the importance of ongoing investment and innovation to modernize its data infrastructure and continuously improve the NLM resources and services that best support our customers’ needs and drive scientific discovery.

Kim D. Pruitt, PhD

Acting Director, NLM National Center for Biotechnology Information and Associate Director for Scientific Data, NLM

Dr. Pruitt serves as Acting Director of the NLM National Center for Biotechnology Information (NCBI), where she is responsible for NCBI’s strategic directions, public services, and budget. She is also NLM’s Associate Director of Scientific Data. Prior to this, Dr. Pruitt was Chief of the NCBI Information Engineering Branch, which builds and manages NLM public services such as GenBank, the Sequence Read Archive (SRA), BLAST, ClinicalTrials.gov, PubMed, and more. She previously developed the NCBI RefSeq resource. She holds a PhD in genetics and development from Cornell University.

Valerie Schneider, PhD

Acting Chief, Information Engineering Branch, National Center for Biotechnology Information, NLM

Dr. Schneider oversees the collection, creation, analysis, organization, curation, and dissemination of data and analysis tools in the areas of molecular biology and genetics, as well as the collection and management of bibliographic information. Previously, Dr. Schneider served as the Deputy Director of Sequence Offerings and the head of the Sequence Plus Program at NCBI. In those roles, she coordinated efforts associated with the curation, enhancement, and organization of sequence data and oversaw tools and resources that enable the public to access, analyze, and visualize biomedical data. She earned a PhD in Biological and Biomedical Sciences from Harvard University.