Fostering a Culture of Scientific Data Stewardship

Guest post by Jerry Sheehan, Deputy Director, National Library of Medicine.

Making research data broadly findable, accessible, interoperable, and reusable is essential to advancing science and accelerating its translation into knowledge and innovation. The global response to COVID-19 highlights the importance and benefits of sharing research data more openly.

The National Institutes of Health (NIH) has long championed policies that make the results of research available to the public. Last week, NIH released the NIH Policy for Data Management and Sharing (DMS Policy) to promote the management and sharing of scientific data generated from NIH-funded or conducted research. This policy replaces the 2003 NIH Data Sharing Policy.

The DMS policy was informed by public feedback and requires NIH-funded researchers to plan for the management and sharing of scientific data. It also makes clear that data sharing is a fundamental part of the research process.

Data sharing benefits the scientific community and the public.

For the scientific community, data sharing enables researchers to validate scientific results, increasing transparency and accountability. Data sharing also strengthens collaborations that allow for richer analyses. Strong data-sharing practices facilitate the reuse of hard-to-generate data, such as those acquired during complex experiments or once-in-a-lifetime events like natural disasters or pandemics.

For the public, sound data-sharing practices demonstrate good stewardship of taxpayer funds. Clear, well-written data sharing and management plans promote transparency and accountability to society. They also expand opportunities for data to be access and reused by clinicians, students, educators, and innovators in health care and other sectors of the economy.

As an organization dedicated to improving access to data and information to advance biomedical sciences and public health, NLM plays a key role in implementing the new policy and supporting researchers in meeting its requirements. NLM maintains a number of data repositories, such as the Sequence Read Archive and ClinicalTrials.gov, that curate, preserve, and provide access to research data. NLM also maintains a longer list of NIH-supported data repositories that accept different types of data (e.g., genomic, imaging) from different research domains (e.g., cancer, neuroscience, behavioral sciences). Where appropriate domain-specific repositories do not exist, NLM has made clear how researchers can include small datasets (<2GB) with articles deposited in NLM’s PubMed Central (PMC) under the NIH Public Access Policy.

NLM also works with the broader library community to support improved data management and sharing. Supplemental information issued with the new policy makes it clear that research budgets can include costs of data management and sharing, such as those for data curation, formatting data to accepted standards, attaching metadata to foster discoverability, and preparing data for storage in a repository. These are the kinds of services increasingly provided by libraries and librarians in universities and academic medical centers across the country. NLM, through the Network of the National Library of Medicine, offers training in data management and data literacy to health science, public, and other librarians to expand capacity for these important services.

NIH’s DMS Policy applies to all research, funded or conducted in whole or in part by NIH, that results in the generation of scientific data. This includes research funded or conducted by extramural grants, contracts, intramural research projects, or other funding agreements. The DMS Policy does not apply to research and other activities that do not generate scientific data, including training, infrastructure development, and non-research activities.

NIH will continue to engage the research community to support the change and implementation of this new policy, which will go into effect in January 2023. NLM will continue to work within NIH and across the library and information science communities to develop innovative ways to support the policy and advance the effective stewardship of research data. Let us know how else we can support this important policy advance.

Read more about this major policy release in the NIH’s Under the Poliscope blog.

As NLM Deputy Director, Jerry Sheehan shares responsibility with the Director for overall program development, program evaluation, policy formulation, direction and coordination of all Library activities. He has made major contributions to the development and implementation of NIH, HHS, and U.S. government-wide policy related to open science, public access to government-funded information, clinical trials registration, and electronic health records.

Some Insights on the Roles and Uses of Generalist Repositories

Guest post by Susan Gregurick, PhD, Associate Director for Data Science and Director, Office of Data Science Strategy, NIH

Data repositories are a useful way for researchers to both share data and make their data more findable, accessible, interoperable, and reusable (that is, aligned with the FAIR Data Principles).

Generalist repositories can house a vast array of data. This kind of repository does not restrict data by type, format, content, or topic. NIH has been exploring the roles and uses of generalist repositories in our data repository landscape through three activities, which I describe below, garnering valuable insights over the last year.

A pilot project with a generalist repository

NIH Figshare archive

Last September, I introduced Musings readers to the one-year Figshare pilot project, which was recently completed. Information about the NIH Figshare instance — and the outcomes of the project — is available on the Office of Data Science Strategy’s website. This project gave us an opportunity to uncover how NIH-funded researchers might utilize a generalist repository’s existing features. It also allowed us to test some specific options, such as a direct link to grant information, expert guidance, and metadata improvements.

There are three key takeaways from the project:

  • Generalist repositories are growing. More researchers are depositing data in, and more publications are linking to, generalist repositories.
  • Researchers need more education and guidance on where to publish data and how to effectively describe datasets using detailed metadata.
  • Better metadata enables greater discoverability. Expert metadata review proved to be one of the most impactful and unique features of the pilot instance, which we determined through two key metrics. When compared to data uploaded to the main Figshare repository by NIH-funded investigators, the NIH Figshare instance had files with more descriptive titles (e.g., twice as long) and metadata descriptions that were more than three times longer.
Illustrating how professionals can identify opportunities for collaboration and competition.

The NIH Figshare instance is now an archive, but the data are still discoverable and reusable. Although this specific pilot has concluded, we encourage NIH-funded researchers to use a generalist repository that meets the White House Office of Science and Technology Policy criteria when a domain-specific or institutional repository is not available.

A community workshop on the role of generalist repositories

In February, the Office of Data Science Strategy hosted the NIH Workshop on the Role of Generalist and Institutional Repositories to Enhance Data Discoverability and Reuse, bringing together representatives of generalist and institutional repositories for a day and a half of rich discussion. The conversations centered around the concept of “coopetition,” the importance of people in the broader data ecosystem, and the importance of code. A full workshop summary is available, and our co-chairs and the workshop’s participating generalist repositories recently published a generalist repository comparison chart as one of the outcomes of this event.

We plan to keep engaging with this community to better enable coopetition among repositories while working collaboratively with repositories to ensure that researchers can share data effectively.

An independent assessment of the generalist repository landscape

We completed an independent assessment to understand the generalist repository landscape, discover where we were in tune with the community, and identify our blind spots. Key findings include the following:

  • There is a clear need for the services that generalist repositories provide.
  • Many researchers currently view generalist repository platforms as a place to deposit their own data, rather than a place to find and reuse other people’s data.
  • Repositories and researchers alike are looking to NIH to define its data sharing requirements, so each group knows what is expected of them.
  • The current lack of recognition and rewards for data sharing helps reinforce the focus on publications as the key metric of scientific output and therefore may be a disincentive to data sharing.

The pilot, workshop, and assessment provided us with a deeper understanding of the repository landscape.

We are committed to advancing progress in this important area of the data ecosystem of which we are all a part. We are currently developing ways to continue fostering coopetition among generalist repositories; strategies for increasing engagement with researchers, institutional repositories, and data librarians; and opportunities to better educate the biomedical research community on the value of effective data management and sharing.

The Office of Data Science Strategy will announce specific next steps in the near future. In the meantime, we invite you to share your ideas with us at datascience@nih.gov.

Dr. Gregurick leads the implementation of the NIH Strategic Plan for Data Science through scientific, technical, and operational collaboration with the institutes, centers, and offices that make up NIH. She has substantial expertise in computational biology, high performance computing, and bioinformatics.

How NIH Is Using Artificial Intelligence To Improve Operations

Artificial intelligence (AI) is everywhere, from the online marketplace to the laboratory! When you read an article or shop online, the experience is probably supported by AI. And scientists are applying AI methods to find indications of disease, to design experiments, and to make discovery processes more efficient.

The National Institutes of Health (NIH) has been using AI to improve science and health, too, but it’s also using AI in other ways.

Earlier this fall, the White House Office of Science and Technology Policy hosted a summit to highlight ways that the Federal Government uses AI to achieve its mission and improve services to the American people. I was proud to represent NIH and provide examples of how AI is being used to make NIH more effective and efficient in its work.

For example, each year NIH faces the challenge of assigning the more than 80,000 grant applications it receives to the proper review group.

Here’s how the process works now:  Applications that address specific funding opportunity announcements are assigned directly by division directors. Then the Integrated Review Groups (clusters of study sections grouped around general scientific areas) assign the applications to the correct division or scientific branch. A triage officer handles assignments without an identified liaison. This process takes several weeks and may involve passing an application through multiple staff reviews.

Staff at NIH’s National Institute of General Medical Sciences (NIGMS) creatively addressed this challenge by developing and deploying natural language processing and machine learning to automate the process for their Institute. This approach uses a machine learning algorithm, trained on historical data, to find a relationship between the text (title, abstract, and specific aims) and the scientific research area of an application. The trained algorithm can then determine the most likely scientific area of a new application and automatically assign it a program officer who is a subject matter expert in that area.

The new process works impressively well, with 92% of applications referred to the correct scientific division and 84% assigned to the correct program officer, matching the accuracy rate routinely achieved by manual referrals. This change has resulted in substantial time savings, reducing the process from two to three weeks to less than one day. The new approach ensures the efficient and consistent referral of grant applications and liberates program officers from the labor-intensive and monotonous manual referral process, allowing them to focus on higher-value work. It even allows for related institutional knowledge to be retained after staff departures. NIGMS is currently working with the NIH electronic Research Administration (eRA) to incorporate the process into the enterprise database for NIH-wide use.

Now for a second example that’s more pertinent to NLM.

Our PubMed repository receives over 1.2 million new citations each year, and over 2.3 million people conduct about 2.5 million searches using PubMed every day. An average query returns hundreds to thousands of results presented in reverse chronological order of the date the record is added. Yet our internal process-monitoring determined that 80% of the people using PubMed do not go beyond the first page of results, a behavior also seen in general web searches. This means that even if a more relevant citation is on page 4 or page 18, the user may never know.

Zhiyong Lu, PhD and his team from NLM’s National Center for Biotechnology Information applied machine learning strategies to improve the way PubMed presents search results. Their goals were to increase the effectiveness of PubMed searches by helping users efficiently find the most relevant and high-quality information and improve usability and the user experience through a focus on the literature search behaviors and needs of users. Their approach is called the Best Match algorithm, and the technical details can be found in a paper by Fiorini N, Canese K, Starchenko G, et al., PLoS Biol. 2018.

The Best Match algorithm works like this:  In preparation for querying, all articles in the PubMed repository are tagged with key information and metadata, including the publication date, and with an indicator of how often the article has been returned and accessed by previous searches, as part of a model-training process called Learning-to-Rank (L2R). Then, when a user enters a query phrase in the search box on the PubMed website, the phrase is mapped using the PubMed syntax, and the search is launched. In a traditional search, the results are selected based on keyword matching and are presented in reverse chronological order. Through Best Match, the top 500 results—returned via a classic term-weighting algorithm—are re-sorted according to dozens of features of the L2R algorithm, including the past usage of an article, publication date, relevance score, and type of article. At the top of the page, the search results are clearly marked as being sorted by “Best Match.”

Image showing preparing and refining preparing, matching, ranking and refining articles in NLM's PubMed
Picture by Donald Bliss of NLM

Articles prepared prior to user searches; 1.Queries changed to PubMed syntax; 2.Initial Matching Hits presented in reverse chronological order; 3.Results are re-sorted according to the L2R algorithm to present the Best Match; and 4.The L2R algorithm is updated based on user top choices

This new approach will become the core of a new implementation of PubMed, due out by the spring of 2020.

In addition to the examples I described above, NIH is exploring other ways to use AI. For example, AI can help determine whether the themes of research projects align with the stated priorities of a specific Institute, and it can provide a powerful tool to accelerate government business practices. Because of AI’s novelty, NIH engages in many steps to validate the results of these new approaches, ensuring that unanticipated problems do not arise.

In future posts, I look forward to sharing more about how NIH is improving operations through innovative analytics.

Addressing Social Determinants of Health with FHIR Technology

Guest post by Clem McDonald, MD, Chief Health Data Standards Officer at NLM; Jessica Tenenbaum, PhD, Chief Data Officer for North Carolina’s Department of Health and Human Services; and Liz Amos, MLIS, Special Assistant to the Chief Health Data Standards Officer at NLM.

We all know that whether you get an annual flu shot or smoke affects your health. But nonmedical social and economic factors are also large influences on health. For example, individuals will struggle to control their diabetes if they can’t afford healthy food or are sleeping on the street. Healthy People 2020 describes such circumstances as social determinants of health (SDOH). As our health system shifts to value-based payments models, health care systems are prioritizing outcomes, such as the level of glucose control, rather than how much care is delivered (e.g., the number of visits or tests). To achieve better health outcomes, leading organizations are working to identify and address SDOH needs as well as medical needs.

The North Carolina Department of Health and Human Services (NCDHHS) Healthy Opportunities program identifies four priority domains of non-medical needs that can be detected using the answers to screening questions. Screening for needs in these domains will be a standard operating procedure for all Medicaid beneficiaries as the state transitions its Medicaid program to managed care from fee-for-service. Health care providers will be able to refer individuals to community resources such as food pantries, homeless shelters, transportation services, interpersonal violence counselors, and other services that can address some of these nonmedical needs, and the organizations can then be reimbursed for approved services under Medicaid. A computer-based “closed-loop” referral system will enable the collection of information from social service organizations about the services provided, allowing NCDHHS to facilitate reimbursement, monitor the program, and assess its effectiveness. Electronic systems like the one being used in North Carolina are essential to capturing answers to the SDOH screening questions, triaging individuals to appropriate community resources for intervention, and tracking the effects of those interventions. North Carolina is building a “learning” Department of Health and Human Services, similar to a learning health system, with data collected through services provided used to inform future policy decisions.

The SDOH needs being addressed in North Carolina exist across the country, so there is considerable interest in developing standards-based systems for capturing SDOH data anywhere in the United States without the need for separate development efforts at each stage. A powerful mechanism called Fast Healthcare Interoperability Resources®, or FHIR®, has emerged to enable standardization across a broad spectrum of health care processes. Developed by Health Level Seven International, FHIR is a modern, web-based technology for exchanging health care data that has strong and growing support from various stakeholders in the field of health care, including major electronic health record vendors; the tech industry, including Apple, Microsoft, Google, and Amazon; and federal agencies such as NIH, the Office of the National Coordinator for Health Information Technology, the Centers for Medicare and Medicaid Services, the Food and Drug Administration, and the Agency for Healthcare Research and Quality. NCDHHS is exploring the use of a FHIR-based data-capture tool for collecting SDOH information about nonmedical health needs and delivering the survey results to health care providers who can address the needs identified.

Created in the spirit of collaboration, NLM’s FHIR questionnaire app — an open-source tool that can be used, modified, or incorporated into existing tools by anyone — instantly converts a questionnaire that follows FHIR’s technical specifications into a live web form. It leverages the FHIR standard to collect questionnaire data, and generating a different form is just a matter of feeding the tool a different set of questions. FHIR forms can implement skip logic, the nesting of repeated groups of questions, calculations, validation checks, the repopulation of questions with answers from the individual’s FHIR medical record, and more. Of course, the same tool can also implement many other kinds of forms for capturing health care data, such as surveys that measure patient-reported outcomes. You can search more than 2,000 available questionnaires in NLM’s FHIR questionnaire demo app. Other NLM-developed, open-source FHIR-based tools for managing health care data are available here.

NLM and NCDHHS have worked together to develop an open-source, FHIR-based implementation of North Carolina’s Healthy Opportunities screening questions (see figure 1). Anyone with a FHIR-ready server will be able to download the form, enter data, and then route those data to the appropriate health information technology system.

Let’s get to work screening patients broadly while minimizing clinical documentation burdens through the use of standardized application programming interfaces!

 

Figure 1: North Carolina Department of Health and Human Services (NCDHHS)’s Social Determinants of Health (SDOH) Screening Form as a live FHIR Questionnaire demo.
Figure 1: North Carolina Department of Health and Human Services (NCDHHS)’s Social Determinants of Health (SDOH) Screening Form as a live FHIR Questionnaire demo.

 


Clem McDonald, MD

Clem McDonald, MD, is the Chief Health Data Standards Officer at NLM. In this role, he coordinates standards efforts across NLM and NIH, including the FHIR interoperability standard and vocabularies specific to clinical care (LOINC, SNOMED CT, and RxNorm). Dr. McDonald developed one of the nation’s first electronic medical record systems and the first community-wide clinical data repository, the Indiana Network for Patient Care. Dr. McDonald previously served 12 years as Director of the Lister Hill National Center for Biomedical Communications and as scientific director of its intramural research program.

Jessica Tenenbaum, PhD

Jessica Tenenbaum, PhD, is the Chief Data Officer for North Carolina’s Department of Health and Human Services. In this role, Dr. Tenenbaum is responsible for the development and oversight of departmental data governance and strategy to enable data-driven policy for improving the health and well-being of North Carolinians. Dr. Tenenbaum is also an Assistant Professor in Duke University’s Department of Biostatistics and Bioinformatics. Dr. Tenenbaum is a member of the Board of Directors for the American Medical Informatics Association and serves on the Board of Scientific Counselors for NLM.

Liz Amos, MLIS

Liz Amos, MLIS, is Special Assistant to the Chief Health Data Standards Officer at NLM. She is a graduate of the University of Tulsa and the University of Oklahoma.

Enhancing Data Sharing, One Dataset at a Time

Guest post by Susan Gregurick, PhD, Associate Director for Data Science and Director, Office of Data Science Strategy, National Institutes of Health

Circular graphic showing Findable, Accessible, Interoperable, and Reusable aspects of the Vision of the NIH Strategic Plan for Data Science
Vision of the NIH Strategic Plan for Data Science

The National Institutes of Health (NIH) has an ambitious vision for a modernized, integrated biomedical data ecosystem. How we plan to achieve this vision is outlined in the NIH Strategic Plan for Data Science, and the long-term goal is to have NIH-funded data be findable, accessible, interoperable, and reusable (FAIR). To support this goal, we have made enhancing data access and sharing a central theme throughout the strategic plan.

While the topic of data sharing itself merits greater discussion, in this post I’m going to focus on one primary method for sharing data, which is through domain-specific and generalist repositories.

The landscape of biomedical data repositories is vast and evolving. Currently, NIH supports many repositories for sharing biomedical data. These data repositories all have a specific focus, either by data type (e.g., sequence data, protein structure, continuous physiological signals) or by biomedical research discipline (e.g., cancer, immunology, or clinical research data associated with a specific NIH institute or center), and often form a nexus of resources for their research communities. These domain-specific, open-access data-sharing repositories, whether funded by NIH or other sources, are good first choices for researchers, and NIH encourages their use.

NIH’s PubMed Central is a solution for storing and sharing datasets directly associated with publications and publication-related supplemental materials (up to 2 GB in size). On the other end of the spectrum, “big” datasets, comprising petabytes of data, are now starting to leverage cloud service providers (CSPs), including through the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative. These are still the early days of data sharing through CSPs, and we anticipate that this will be an active area of research.

There are, however, instances in which researchers are unable to find a domain-specific repository applicable to their research project. In these cases, a generalist repository that accepts data regardless of data type or discipline may be a good fit. Biomedical researchers already share data, software code, and other digital research products via many generalist repositories hosted by various institutions—often in collaboration with a library—and recommended by journals, publishers, or funders. While NIH does not have a recommended generalist repository, we are exploring the roles and uses of generalist repositories in our data repository landscape.

screenshot of NIH Figshare homepage
NIH Figshare homepage https://nih.figshare.com

For example, as part of our exploratory strategy, NIH recently launched an NIH Figshare instance, a short-term pilot project with the generalist repository Figshare. This pilot provides NIH-funded researchers with a generalist repository option for up to 100 GB of data per user. The NIH Figshare instance complies with FAIR principles; supports a wide range of data and file types; captures customized metadata; and provides persistent unique identifiers with the ability to track attention, use, and reuse.

NIH Figshare is just one part of our approach to understanding the role of generalist repositories in making biomedical research data more discoverable. We recognize that making data more FAIR is no small task and certainly not one that we can accomplish on our own. Through this pilot project, and other related projects associated with implementing NIH’s strategy for data science, we look forward to working with the biomedical community—researchers, librarians, publishers, and institutions, as well as other funders and stakeholders—to understand the evolving data repository ecosystem and how to best enable useful and usable data sharing.

Together we can strengthen our data repository ecosystem and ultimately, accelerate data-driven research and discovery. We invite you to join our efforts by sending your ideas and needs to datascience@nih.gov.

Susan Gregurick, PhD

Dr. Gregurick leads the NIH Strategic Plan for Data Science through scientific, technical, and operational collaboration with the institutes, centers, and offices that comprise NIH. She has substantial expertise in computational biology, high performance computing, and bioinformatics.