Guest post by Elizabeth Kittrie, NLM’s Senior Planning and Evaluation Officer.
As scientific research becomes more data-intensive, scientists and their institutions are increasingly faced with complex questions about which data to retain, for how long, and at what cost.
The decision to preserve and archive research data should not be posed as a yes or no question. Instead, we should ask, “For how many years should this subset of data be preserved or archived?” (By the way, “forever” is not an acceptable response.)
Answering questions about research data preservation and archiving is neither straightforward nor uniform. Certain types of research data may derive value from their unique qualities or because of the costs associated with the original data collection. Other types of research data are relatively easy to collect at low cost; yet once collected, they are rarely re-used.
- What is the future value of research data?
- For how long must a dataset be preserved before it should be reviewed for long-term archiving?
- What are the resources necessary to support persistent data storage?
We believe that economic approaches—including forecasting long-term costs, balancing economic considerations with non-monetary factors, and determining the return on public investment from data availability—can help us make preservation and archiving decisions.
Economic approaches…can help us make preservation and archiving decisions.
To that end, NLM has contracted with the National Academies of Sciences, Engineering, and Medicine (NASEM) for a study on forecasting the long-term costs for preserving, archiving, and promoting access to biomedical data. For this study, NASEM will appoint an ad hoc committee that will develop and demonstrate a framework for forecasting these costs and estimating potential benefits to research. In so doing, the committee will examine and evaluate the following:
- Economic factors to be considered when examining the life-cycle cost for data sets (e.g., data acquisition, preservation, and dissemination);
- Cost consequences for various practices in accessioning and de-accessioning data sets;
- Economic factors to be considered in designating data sets as high value;
- Assumptions built in to the data collection and/or modeling processes;
- Anticipated technological disruptors and future developments in data science in a 5- to 10-year horizon; and
- Critical factors for successful adoption of data forecasting approaches by research and program management staff.
The committee will provide a consensus report and two case studies illustrating the framework’s application to different biomedical contexts relevant to NLM’s data resources. Relevant life-cycle costs will be delineated, as will any assumptions underlying the models. To the extent practicable, NASEM will identify strategies to communicate results and gain acceptance of the applicability of these models.
As part of its information gathering, NASEM will host a two-day public workshop in late June 2019 to generate ideas and approaches for the committee to consider. We will provide further details on the workshop and how you can participate in the coming months.
As a next step in advancing this study, we are supporting NASEM’s efforts to solicit names of committee members, as well as topics for the committee to consider. If you have suggestions, please contact Michelle Schwalbe, Director of the Board on Mathematical Sciences and Analytics at NASEM.
Elizabeth Kittrie is NLM’s Senior Planning and Evaluation Officer. She previously served as a Senior Advisor to the Associate Director for Data Science at the National Institutes of Health and as Senior Advisor to the Chief Technology Officer of the US Department of Health and Human Services. Prior to joining HHS, she served as the first Associate Director for the Department of Biomedical Informatics at Arizona State University.