Guest post by Melissa Haendel, PhD, a leader of and advocate for open science initiatives.
The increasing volume and variety of biomedical data have created new opportunities to integrate data for novel analytics and discovery. Despite a number of clinical success stories that rely on data integration (e.g., rare disease diagnostics, cancer therapeutic discovery, drug repurposing), within the academic research community, data reuse is not typically promoted. In fact, data reuse is often considered “not innovative” in funding proposals and has even come under attack. (See the now infamous “research parasites” editorial in The New England Journal of Medicine.)
The FAIR data principles—Findable, Accessible, Interoperable, and Reusable—are a terrific set of goals for all of us to strive for in our data sharing, but they detail little about how to realize effective data reuse. If we are to grow innovation from our collective data resources, we must look to pioneers in data harmonization for insight into the specific advantages and challenges of data reuse at scale. Current data-licensing practices for most public data resources severely hamper data reuse, especially at scale. Integrative platforms such as the Monarch Initiative, the NCATS Biomedical Data Translator, the Gabriella Miller Kids First Data Resource Portal, and myriad other cloud data platforms will be able to accelerate scientific progress more effectively if licensing issues can be resolved. As a member of these various consortia, I want to facilitate the legal use and reuse of increasingly interconnected, derived, and reprocessed data. The community has previously raised this concern in a letter to NIH.
How reusable are most data resources? In our recently published manuscript, we created a rubric for evaluating the reusability of a data resource from the licensing standpoint. We applied this rubric to more than 50 biomedical data and knowledge resources. These assessments and the evaluation platform are openly available at the (Re)usable Data Project (RDP). Each resource was scored on a scale of zero to five stars on the following measures:
- findability and type of licensing terms
- scope and completeness of the licensing
- ability to access the data in a reasonable way
- restrictions on how the data may be reused, and
- restrictions on who may reuse the data.
We found that 57% of the resources scored three stars or fewer, indicating that license terms may significantly impede the use, reuse, and redistribution of the data.
Custom licenses constituted the largest single class of licenses found in these data resources. This suggests the resource providers either did not know about standard licenses or believed the standard licenses did not meet their needs. Moreover, while the majority of custom licenses were restrictive, just over two-thirds of the standard licenses were permissive, leading us to wonder whether some needs and intentions are not being met by the existing set of standard permissive licenses. In addition, about 15% of resources had either missing or inconsistent licensing. This ambiguity and lack of clear intent requires clarification and possibly legal counsel.
Putting this all together, a majority of resources would not meet basic criteria for legal frictionless use for downstream data integration and redistribution, despite the fact that most of these resources are publicly funded, which should mean the content is freely available for reuse by the public.
If we in the United States have a hard time understanding how we may reuse data given these legal restrictions, we must consider the rest of the world—which presumably we aim to serve—and how hard it would be for anyone in another country to navigate this legalese. I hope the RDP’s findings will encourage the worldwide community to work together to improve licensing practices to facilitate reusable data resources for all.
Given what I have learned from the RDP and a wealth of experience in dealing with these issues, I recommend the following actions:
- Funding agencies and publishers should ensure that all publicly funded databases and knowledge bases are evaluated against licensing criteria (whether the RDP’s or something similar).
- Database providers should use these criteria to evaluate their resources from the perspective of a downstream data user and update their licensing terms, if appropriate.
- Downstream data re-users should provide clear source attribution and should always confirm it is legal to redistribute the data. It is very often the case that it is legal to use the data but not to redistribute it. In addition, many uses are actually illegal.
- Database providers should guide users on how to cite the resource as a whole, as individual records, or as portions of the content when mashed up in other contexts (which can include schemas, ontologies, and other non-data products). Where relevant, providers should follow best practices declared by a community, for example the Open Biological Ontologies citation policy, which supports using native object identifiers rather than creating new digital objects.
- Data re-users should follow best practices in identifier provisioning and reference within the reused data so it is clear to downstream users what the license actually applies to.
To be useful and sustainable, data repositories and curated knowledge bases need to clearly credit their sources and specify the terms of reuse and redistribution.
I believe that, to be useful and sustainable, data repositories and curated knowledge bases need to clearly credit their sources and specify the terms of reuse and redistribution. Unfortunately, these resources are currently and independently making noncompatible choices about how to license their data. The reasons are multifold but often include the requirement for sustainable revenue that is counter to integrative and innovative data science.
Based on the productive discussions my collaborators and I have had with data resource providers, I propose the community work together to develop a “data trust.” In this model, database resource providers could join a collective bargaining organization (perhaps organized as a nonprofit), through which they could make their data available under compatible licensing terms. The aggregate data sources would be free and redistributable for research purposes, but they could also have commercial use terms to support research sustainability. Such a model could leverage value- or use-based revenue to incentivize resource evolution and innovation in support of emerging needs and new technologies, and would be governed by the constituent member organizations.
Melissa Haendel, PhD, leads numerous local, national, and global open science initiatives focused on semantic data integration and disease mechanism discovery and diagnosis, namely, the Monarch Initiative, the Global Alliance for Genomics and Health (GA4GH), the National Center for Data to Health (CD2H), and the NCATS Biomedical Data Translator.