Around the world, scientists, public health officials, medical professionals, and others are working to address the coronavirus pandemic.
At NLM, we’ve been working on multiple fronts to improve researchers’ understanding of SARS-CoV-2 (the virus that causes the novel coronavirus) and aid in the response to COVID-19 (the disease caused by the novel coronavirus). By enhancing access to relevant data and information, NLM is demonstrating how libraries can contribute in real time to research and response efforts during this crisis.
Harnessing the Power of PubMed Central® to Enable Data Science Research
With new initiatives launched in the past two weeks, NLM is using PubMed Central®, our digital archive of peer-reviewed biomedical and life sciences journal literature, to expand access to full-text articles related to coronavirus. These activities build on recent requests from the White House Office of Science and Technology Policy (OSTP) and science policy leaders of other nations calling on the global publishing community to make all COVID-19-related research publications and data immediately available to the public in forms that support automated text-mining.
NLM has stepped up its collaboration with publishers and scholarly societies to increase the number of coronavirus-related journal articles in PMC, along with the available data supporting them. NLM is adapting its standard procedures for depositing articles into PMC to make it easier and faster to submit articles in machine-readable formats. We’re also engaging with journals and publishers that do not participate in PMC but whose publications are within the scope of the Library’s collection. A growing number of publishers and societies are taking advantage of these flexibilities.
Submitted publications are being made available as quickly as possible after publication for discovery in PMC and through the PMC Text Mining Collections for machine analysis, secondary analysis, and other types of reuse.
This enhanced collection of text-minable content enables AI and machine-learning researchers to develop and apply novel text-mining approaches that can help answer some of the many questions about coronavirus.
Along these lines, NLM and leaders across the technology sector and academia joined OSTP on Monday, March 16 to announce the COVID-19 Open Research Dataset (CORD-19). Hosted by the Allen Institute for AI, CORD-19 is a free and growing resource that was launched with more than 29,000 scholarly articles about COVID-19 and the coronavirus family of viruses. CORD-19 represents the most extensive machine-readable coronavirus literature collection available for text mining to date. This dataset enables researchers to apply novel AI and machine learning strategies to identify new knowledge to help end the pandemic. Researchers can submit text and data mining tools to be applied to the dataset via the Kaggle platform.
Providing Other Resources
NLM’s efforts to support coronavirus-related research and response efforts are not limited to PMC. Key among the Library’s other important resources are:
- NLM’s GenBank Sequence Database — NLM created the Severe acute respiratory syndrome coronavirus 2 data hub, where people can search for, retrieve, and analyze sequences of the virus that have been submitted to GenBank. Our GenBank team is expediting the processing of all SARS-associated coronavirus sequences. Depending on the quality of submitted sequences, they are being annotated and released to the public as fast as 20 minutes after receipt and, in almost all cases, within 24 hours of receipt.
- NLM’s Sequence Read Archive (SRA) — NLM’s SRA is the world’s largest publicly available repository of unprocessed sequence data which can be mined for previously unrecognized pathogen sequence. For example, a team from Stanford University recently reported that in a search of certain metagenomic datasets in the SRA, they identified a 2019-nCoV-like coronavirus in pangolins (a long-snouted mammal). This type of genetic sequence research can play an important role in understanding how the virus originated and is spreading.
NLM has also made the SRA available through commercial cloud providers participating in the National Institutes of Health’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative. Researchers may now compute across the full SRA dataset for metagenomic research on the new coronavirus in ways not previously possible.
- NLM Intramural Research Contributions —NLM has a multidisciplinary group of researchers comprised of molecular biologists, biochemists, computer scientists, mathematicians and others working on a variety of problems, including some that relate to SARS-CoV-2/COVID-19. One such project is LitCovid, a resource that tracks COVID-19 specific literature published since the outbreak. This resource builds on NLM research to develop new approaches to locating and indexing the literature related to COVID-19 including a text classification algorithm for screening and ranking relevant documents, topic modeling for suggesting relevant research categories, and information extraction for obtaining geographic location(s) found in the abstract.
NLM is also providing targeted searches within several of its other information resources to help users find data and information relevant to COVID-19. These searches, available through the NLM home page, include information on clinical studies related to COVID-19 listed in ClinicalTrials.gov, and articles related to the SARS-CoV-2/COVID-19 in PubMed, NLM’s database of citations and abstracts to more than 30 million journal articles and online books.
These resources are already proving to be useful to the scientific community and others working together to address this public health threat.
We continue to seek new ways to support these efforts and demonstrate how libraries can best contribute. Please send us your ideas of how NLM can help you respond to the global health crisis or how your library is contributing.