The Research Data Alliance (RDA) is a community-driven, interdisciplinary, international organization dedicated to collaboratively building the social and technical infrastructure necessary for wide-scale data sharing and advancing open science initiatives. Just short of five years old, this group gathers twice a year at plenary meetings, the most recent just last week.
These are no big-lecture, hallway-conversation meetings. As I discovered in Berlin last week, they are working meetings, in the best sense of the phrase—where the work involves creating and validating the mechanisms and standards for data sharing. That work is done by volunteers from across disciplines—over 7,000 people engaged in small work groups, local activities, and conference-based sessions. These volunteers deliberate and construct standards for data sharing, and then establish strategies for testing and endorsing these standards and gaining community consensus and adoption—including partnering with notable standard-setting bodies such as ISO or IEEE.
Much of the work focuses on making data and data repositories FAIR— Findable, Accessible, Interoperable, and Reusable—which is something I’ve talked a lot about in this blog.
But RDA espouses a broader vision than the approach NLM has taken so far with data. Where we provide public access to full-text articles, some of which link to associated data, RDA advocates for putting all research-generated data in domain-specific, well-curated repositories.
To achieve that vision, RDA members are working to develop the following three key elements:
- a schema to link data to articles,
- a mechanism for citing data extracts, and
- a way to recognize high-quality data repositories.
Right now, a single publisher may have 50 or 60 different ways of linking articles to data. That means that the estimated 25,000 publishers and 5,000 repositories that manage data have potentially millions of ways of accomplishing this task. Instituting a standardized schema to link data to articles would bring significant order and discoverability to this overwhelming diversity. That consistency would yield immediate benefits, tops among them making data findable and the links interoperable.
Efficient data citations will also be a boon to findability. RDA is working on developing dynamic data citations, which would provide persistent identifiers tying data extracts to their repositories and tracking different versions of the data. Machine-created and machine-readable, data citations would enhance rigor and reproducibility in research by ensuring the data generated in support of key findings remains accessible.
But linking to and tracking data won’t get us far if the data itself is untrustworthy.
To address that, RDA encourages well-curated repositories, but what exactly does that mean?
Certification provides one way of acknowledging the quality of a repository. RDA doesn’t sponsor a certification mechanism, but it recognizes several, including the CoreTrustSeal program. (For more on data certification, see “A Primer on the Certifications of a Trusted Digital Repository,” by Dawei Lin from the NIH National Institute of Allergy and Infectious Diseases.)
But why does all this matter to NIH and to NLM specifically?
I came to the RDA meeting to explore complementary approaches to what NLM is already doing to curate and assign metadata to data. I was especially looking for guidance on how to handle new data types such as images and environmental exposures.
I got some of that, but I also learned that NLM has much to contribute to RDA’s work. Particularly given our expertise in clinical terminologies and literature languages, we add rich depth to the ways data and other resources can be characterized.
In addition, I learned that we at NLM and NIH face many of the same challenges as our global partners: efficiently managing legacy data while not constraining the future to the problems of the past; fostering the adoption of common approaches and standards when the benefit to the larger scientific community may be greater than the value to the individual investigator; coordinating a voluntary, community-led process that has mission-critical consequences; and creating a permanent home and support organization for the wide range of standards actually needed for data-driven discovery.
Finally, I learned that people participate in the work of RDA because it both draws on their expertise and advances their own scholarly efforts. In other words, it’s mutually beneficial. But after my time with the group last week, I suspect we all get more than we give. For NLM anyway—as we begin to implement our new strategic plan—RDA’s goal of creating a global data ecosystem of best practices, standards, and interoperable data infrastructures is encouraging and something to look forward to.