I’ve been inspired, but not surprised, to see all the incredible work that’s going on across the NLM and the National Institutes of Health (NIH) to respond to the challenges presented by COVID-19. At NLM, we’ve been working on multiple fronts to improve researchers’ understanding of the novel coronavirus (SARS-CoV-2) and the disease it causes (COVID-19). We were fortunate to receive $10 million as part of the Coronavirus Aid, Relief, and Economic Security (CARES) Act, which provides emergency funding for federal agencies to combat the coronavirus outbreak.
NLM is using this funding to support activities to improve the quality of clinical data for research and care, accelerate research including phenotyping, image analysis, and real-time surveillance, and to enhance access to COVID-19 literature and molecular data resources.
Improving quality of clinical data for research and care
The novel coronavirus is driving a need for standardized COVID-19 terminology and data exchange that will allow clinicians and scientists to communicate more effectively and consistently. NLM supports health data standards, and we are using supplemental funds to support the addition of codes for COVID-19-related laboratory tests within LOINC (Logical Observation Identifiers Names and Codes) and to provide implementation guidelines and training in use of the standards.
We are also expanding our ability to process and distribute new codes for major terminology sources used by health care providers, electronic health record systems, and commercial health care systems, which are vital to monitoring and measuring COVID-19 patient outcomes. More specifically, NLM is enabling sharing of COVID-19 terminology updates through the Value Set Authority Center (VSAC), which makes available value sets and clinical terminologies. Value sets are codes from standard terminologies around specific concepts or conditions and are used as part of electronic clinical quality measures or to define patient cohorts, classes of interventions, or patient outcomes. This important work will facilitate the analysis of electronic health record data and support effective and interoperable health information exchange.
NLM is updating terminology for coronavirus-related drugs and chemicals through resources such as the Medical Subject Headings (MeSH) used for indexing and cataloging biomedical literature, and ChemIDplus, a dictionary of over 400,000 chemicals (names, synonyms, and structures). This work aligns terminology to facilitate the identification of chemicals and drugs used to treat, detect, and prevent COVID-19 and other coronavirus-related infections, including severe acute respiratory syndrome (SARS), and Middle East Respiratory Syndrome (MERS).
NLM’s vibrant intramural and extramural research programs are conducting and supporting research to advance the understanding of the novel coronavirus. Our intramural research program is using virus genomics, health data, and social media data to identify community spread of COVID-19. Our researchers are applying machine learning and artificial intelligence techniques to chest X-rays to differentiate viral pneumonia from bacterial pneumonia – expanding knowledge of the process of the SARS-CoV-2 viral infection and assisting in the identification of best practices for diagnosis and care of COVID-19 patients. NLM research in natural language processing contributed to development of LitCovid, a curated literature hub for tracking scientific publications about the novel coronavirus. It provides centralized access to more than 13,500 relevant articles in PubMed, categorizes them by research topic and geographic location, and is updated daily.
Our extramural research program is focusing on novel informatics and data science methods to rapidly improve the understanding of the infection of SARS-CoV-2 and of COVID-19. In April, NLM issued two Notices of Special Interest (NOT-LM-010 and NOT-LM-011) seeking applications (due in June) in these areas: the mining of clinical data for ‘deep phenotyping’ (gathering details about how a disease presents itself in an individual, fine-grained way) to identify or predict the presence of COVID-19; and public health surveillance methods that mine genomic, viromic, health data, environmental data or data from other pertinent sources such as social media, to identify spread and impact of SARS-Cov-2.
Enhancing access to COVID-19 literature and molecular data resources
NLM is also improving access to published coronavirus literature via PubMed Central (PMC). In response to a call by science and technology advisors from a dozen countries to have publishers and scholarly societies make their COVID-19 and coronavirus-related publications immediately accessible in PMC, along with the available data supporting them, nearly 50 publishers have deposited more than 46,000 coronavirus-related articles in PMC with licenses that allow re-use and secondary analysis. Articles in the collection have been accessed more than 8 million times since March 18. NLM will use supplemental funds to improve the article-submission system to better accommodate publisher submissions and accelerate release of these critically important articles. On the PubMed side of literature offerings, NLM supplemental funds will support integrating LitCovid metadata. Novel sensors are being developed to leverage LitCovid metadata when directing users to curated COVID-19 content. The new infrastructure will permit PubMed to rapidly add additional disease-specific sensors in the future.
On January 12, 2020, NLM’s GenBank, the world’s largest genetic sequence database, released the first SARS-CoV-2 sequence to the public and the first sequence collected in the United States in collaboration with the Centers for Disease Control and Prevention (CDC) on January 25. As of May 7, GenBank has 3,893 SARS-CoV-2 sequences from 42 different countries that are publicly available. We created a special site, the “Severe acute respiratory syndrome coronavirus 2 data hub,” where people can search, retrieve, and analyze sequences of the virus that have been submitted to the GenBank database. In late March, we joined the CDC-led SPHERES consortium, a national genomics consortium which aims to coordinate U.S. SARS-CoV-2 sequencing efforts and make data publicly available in NLM’s GenBank and Sequence Read Archive (SRA), and other appropriate repositories. Supplemental funds will allow GenBank to further enhance the submission workflow, establish and promote use of metadata sample standards, and develop a fully automated SARS-CoV-2 submission workflow that incorporates quality checks, as well as ‘automated curation’, to provide standardized annotation of the SARS2 genomes submitted to GenBank.
SRA is positioned as a ready-made computational environment for public health surveillance pipelines and tool development. SRA metagenomic datasets from both environmental samples and patients diagnosed with COVID-19 can reveal patterns of co-occurring pathogens, newly emerging outbreaks, and viral evolution. NLM supplemental funds are being used to prototype SRA cloud-based analysis tools to search the entirety of the SRA database. These tools can provide efficient search for SARS-CoV-2, identify genetic patterns, and monitor newly submitted data for specific viral patterns.
NLM supplemental funding also supports the identification and selection of web and social media content documenting COVID-19 as part of NLM’s Global Health Events web archive collection. This content documents life in quarantine, prevention measures, the experiences of health care workers, patients, and more. We are also participating as an institutional contributor to a broader International Internet Preservation Consortium (IIPC) Novel Coronavirus outbreak web archive collection.
These are many of the investments that NLM is making with this emergency funding. I will keep you updated as we continue to make progress on these initiatives.
Researchers: how can you envision using these tools in your own work? What else would be helpful? Let me know in the comments!