The data commons is emerging as a key component to support data science and data-driven research.
The term “data commons” can refer to both the technological platform for storing and manipulating shareable data sets and the set of principles, governance strategies, and utilities that make use of those data sets possible.
Over the last five years, the scientific community has grown to embrace the data commons idea much faster than we’ve been able to agree on how to set up, govern, and fund these data commons. As a result, we’re seeing confusion and duplication in certain quarters. For example, we already have several data commons across NIH, including the Cancer Genome Data Commons, and the National Heart Lung and Blood Institute’s data commons, along with several others from the NIH Data Commons Pilot Program.
As you can imagine, many questions remain unresolved. Should there be one gigantic data commons, encompassing all the data in the world? (No.) Should there be country-specific data commons to help ensure they are established with access, use, and storage principles consistent with each country’s laws and regulations? Should funding authorities have the authority to dictate a common location for storing the data? If so, will these authorities then pay for storage in perpetuity? How long should we retain data? And who decides?
Obviously, we’re not going to solve these issues—and the myriad others associated with data storage and access—in a single blog post. But I do want to affirm the National Library of Medicine’s support for the emerging NIH Strategic Plan for Data Science (PDF). NLM is ready to contribute its experience collecting and managing scientific data and literature to the ongoing discussions of how a data commons could be shaped to best support biomedical discovery.
I envision a world where data-driven research is supported by a variety of data commons, enabled by knowable rules of engagement and governed by a set of key principles. Together, these newly-emerging data commons will identify and shape best practices, culled from current data stewards, active data scientists, and the larger public community.
What might these practices be? And what overall makes a data commons work?
Here are a few of my thoughts:
- A data commons should provide a safe and secure physical space for housing data.
- A data commons should include tools, models, and visualization routines that allow interrogation of the data.
- A data commons should make it easy to locate the data stored within it.
- No data commons is an island. Each data commons should be designed to support discovering and linking to other relevant data sets, whether those data sets are held internally or located in another data commons.
- Contributors to the data commons should be able apply standardized metadata to their data sets.
- A data commons should manage permissions and control access so that it maintains the access rules and conditions under which the data were originally collected.
- A data commons should handle identity and access management in a way that avoids the challenge of continuous and arduous authentication.
- Management issues, such as forecasting the cost of data storage and establishing the time horizon for sunsetting data sets, should be informed by assessments of the data’s present and future value to society.
- All those involved—from data depositors to those who oversee and manage the data commons—should employ the principles of risk trade-offs, balancing the anticipated scientific worth of data sets against the possible loss of valuable data sets.
- Effectively managing legacy data sets provides a needed but incomplete perspective on designing the data commons of the future.
Of course, we’ll need many other principles and ideas to design and establish a robust data commons that can accelerate data-driven discovery in the ways we need and imagine. Please share your thoughts. After all, a commons serves as a meeting place for many perspectives and benefits by those in the community actively engaging with those ideas.
One thought on “What makes a data commons work?”
Here are some of my thoughts: (they apply mostly only to human study data sharing)
1. For data that fit into laptop memory (under 8 GB), the tools are not that important. Ability to download the data is the most important.
2. FAIR princples are important (especially re-usable data). PDF documents can be hard to re-use. CSV and and .html files are much better.
3. general data re-use frameworks are most important (akin to dbGaP general data re-use) (e.g., try to not impose restrictions such as Alzheimer study data can only be re-used in Alzheimer research)
4. track number of downloads for each study in commons (those with 0 downloads can be sunsetted)
5. link data in commons to clinical study registry (clinicaltrials.gov)
6. declare data sharing plan at the time of registering a study in clinicaltrials.gov (clearly state embargo length; chose short embargo periods )- https://www.nejm.org/doi/full/10.1056/NEJMe1705439
7. if possible – don’t require IRB for each data re-use (when a re-using researchers wants to download the data) (example https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/collection.cgi?study_id=phs000688.v1.p1 ) (consider making it part of general agreement with using the platform)
8. APIs can be complex to navigate. Flat file download may seem old fashioned but can be superior to 10k calls to an API