Guest post by Susan Gregurick, PhD, Associate Director for Data Science and Director, Office of Data Science Strategy, National Institutes of Health
The National Institutes of Health (NIH) has an ambitious vision for a modernized, integrated biomedical data ecosystem. How we plan to achieve this vision is outlined in the NIH Strategic Plan for Data Science, and the long-term goal is to have NIH-funded data be findable, accessible, interoperable, and reusable (FAIR). To support this goal, we have made enhancing data access and sharing a central theme throughout the strategic plan.
While the topic of data sharing itself merits greater discussion, in this post I’m going to focus on one primary method for sharing data, which is through domain-specific and generalist repositories.
The landscape of biomedical data repositories is vast and evolving. Currently, NIH supports many repositories for sharing biomedical data. These data repositories all have a specific focus, either by data type (e.g., sequence data, protein structure, continuous physiological signals) or by biomedical research discipline (e.g., cancer, immunology, or clinical research data associated with a specific NIH institute or center), and often form a nexus of resources for their research communities. These domain-specific, open-access data-sharing repositories, whether funded by NIH or other sources, are good first choices for researchers, and NIH encourages their use.
NIH’s PubMed Central is a solution for storing and sharing datasets directly associated with publications and publication-related supplemental materials (up to 2 GB in size). On the other end of the spectrum, “big” datasets, comprising petabytes of data, are now starting to leverage cloud service providers (CSPs), including through the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative. These are still the early days of data sharing through CSPs, and we anticipate that this will be an active area of research.
There are, however, instances in which researchers are unable to find a domain-specific repository applicable to their research project. In these cases, a generalist repository that accepts data regardless of data type or discipline may be a good fit. Biomedical researchers already share data, software code, and other digital research products via many generalist repositories hosted by various institutions—often in collaboration with a library—and recommended by journals, publishers, or funders. While NIH does not have a recommended generalist repository, we are exploring the roles and uses of generalist repositories in our data repository landscape.
For example, as part of our exploratory strategy, NIH recently launched an NIH Figshare instance, a short-term pilot project with the generalist repository Figshare. This pilot provides NIH-funded researchers with a generalist repository option for up to 100 GB of data per user. The NIH Figshare instance complies with FAIR principles; supports a wide range of data and file types; captures customized metadata; and provides persistent unique identifiers with the ability to track attention, use, and reuse.
NIH Figshare is just one part of our approach to understanding the role of generalist repositories in making biomedical research data more discoverable. We recognize that making data more FAIR is no small task and certainly not one that we can accomplish on our own. Through this pilot project, and other related projects associated with implementing NIH’s strategy for data science, we look forward to working with the biomedical community—researchers, librarians, publishers, and institutions, as well as other funders and stakeholders—to understand the evolving data repository ecosystem and how to best enable useful and usable data sharing.
Together we can strengthen our data repository ecosystem and ultimately, accelerate data-driven research and discovery. We invite you to join our efforts by sending your ideas and needs to email@example.com.
Dr. Gregurick leads the NIH Strategic Plan for Data Science through scientific, technical, and operational collaboration with the institutes, centers, and offices that comprise NIH. She has substantial expertise in computational biology, high performance computing, and bioinformatics.