Reflections on the Work of the Research Data Alliance

The Research Data Alliance (RDA) is a community-driven, interdisciplinary, international organization dedicated to collaboratively building the social and technical infrastructure necessary for wide-scale data sharing and advancing open science initiatives. Just short of five years old, this group gathers twice a year at plenary meetings, the most recent just last week.

These are no big-lecture, hallway-conversation meetings. As I discovered in Berlin last week, they are working meetings, in the best sense of the phrase—where the work involves creating and validating the mechanisms and standards for data sharing. That work is done by volunteers from across disciplines—over 7,000 people engaged in small work groups, local activities, and conference-based sessions. These volunteers deliberate and construct standards for data sharing, and then establish strategies for testing and endorsing these standards and gaining community consensus and adoption—including partnering with notable standard-setting bodies such as ISO or IEEE.

Much of the work focuses on making data and data repositories FAIR— Findable, Accessible, Interoperable, and Reusable—which is something I’ve talked a lot about in this blog.

But RDA espouses a broader vision than the approach NLM has taken so far with data. Where we provide public access to full-text articles, some of which link to associated data, RDA advocates for putting all research-generated data in domain-specific, well-curated repositories.

To achieve that vision, RDA members are working to develop the following three key elements:

  • a schema to link data to articles,
  • a mechanism for citing data extracts, and
  • a way to recognize high-quality data repositories.

Right now, a single publisher may have 50 or 60 different ways of linking articles to data. That means that the estimated 25,000 publishers and 5,000 repositories that manage data have potentially millions of ways of accomplishing this task. Instituting a standardized schema to link data to articles would bring significant order and discoverability to this overwhelming diversity. That consistency would yield immediate benefits, tops among them making data findable and the links interoperable.

Efficient data citations will also be a boon to findability. RDA is working on developing dynamic data citations, which would provide persistent identifiers tying data extracts to their repositories and tracking different versions of the data. Machine-created and machine-readable, data citations would enhance rigor and reproducibility in research by ensuring the data generated in support of key findings remains accessible.

But linking to and tracking data won’t get us far if the data itself is untrustworthy.

To address that, RDA encourages well-curated repositories, but what exactly does that mean?

Certification provides one way of acknowledging the quality of a repository. RDA doesn’t sponsor a certification mechanism, but it recognizes several, including the CoreTrustSeal program.  (For more on data certification, see “A Primer on the Certifications of a Trusted Digital Repository,” by Dawei Lin from the NIH National Institute of Allergy and Infectious Diseases.)

But why does all this matter to NIH and to NLM specifically?

I came to the RDA meeting to explore complementary approaches to what NLM is already doing to curate and assign metadata to data. I was especially looking for guidance on how to handle new data types such as images and environmental exposures.

I got some of that, but I also learned that NLM has much to contribute to RDA’s work. Particularly given our expertise in clinical terminologies and literature languages, we add rich depth to the ways data and other resources can be characterized.

In addition, I learned that we at NLM and NIH face many of the same challenges as our global partners: efficiently managing legacy data while not constraining the future to the problems of the past; fostering the adoption of common approaches and standards when the benefit to the larger scientific community may be greater than the value to the individual investigator; coordinating a voluntary, community-led process that has mission-critical consequences; and creating a permanent home and support organization for the wide range of standards actually needed for data-driven discovery.

Finally, I learned that people participate in the work of RDA because it both draws on their expertise and advances their own scholarly efforts. In other words, it’s mutually beneficial. But after my time with the group last week, I suspect we all get more than we give. For NLM anyway—as we begin to implement our new strategic plan—RDA’s goal of creating a global data ecosystem of best practices, standards, and interoperable data infrastructures is encouraging and something to look forward to.

Sharing Small Data Files for More Analysis Power

Guest post by Ashley Hintz, curator for NLM NCBI’s Sequence Read Archive.

Imagine the rows upon rows of books in the stacks of a research library. Then imagine how much information is stored in those books.

An organism’s entire genome is the equivalent of that large library with its multiple floors of loaded shelves. It is densely packed with information.

Just as reference librarians specialize in locating specific information within the library, bioinformaticians specialize in working through, analyzing, and understanding genomes.

And as librarians determine how best to search the online catalog to locate a specific item within the library, bioinformaticians write computer scripts to filter the massive amount of genomic data and identify the parts most relevant to particular research questions. Such filtering creates smaller data files that are easier to analyze, visualize, and share with colleagues.

Unfortunately, many researchers don’t start with the smaller, filtered genomic data sets.

These researchers are stuck believing they must always begin their analysis from the full genome, even though most of the genome’s elements have nothing to do with their research questions.

The idea that you must have the original raw sequence data to analyze the data correctly means that large amounts of data (sometimes terabytes) get moved on a regular basis, or, given the data’s large size, they never move, so access to them is limited.

It’s the equivalent of checking out the whole library when a couple books will answer your questions.

Genomes can be filtered by SNPs (or single nucleotide polymorphisms) from a vcf file, collections of genes, or read counts for gene expression data. All of these smaller data files have practical-use cases and deliver what researchers are commonly interested in.

Current technology also makes sharing these filtered data files incredibly easy. Given their small size (mere gigabytes), it’s more on the scale of sharing pictures of your children or pets.

Sounds good, but what can one accomplish with these filtered genome datasets?

One example: perinatal screening for genes and/or SNPs that can potentially cause disease. Once the entire genome or exome is sequenced, the doctor can examine a vcf file of SNPs, filter that file further using data about the family’s medical history or the infant’s clinical symptoms, and then transfer the data files easily to a specialist for a second opinion. In these cases, the smaller data file becomes a powerful diagnostic tool.

A second example: gene expression data used in research. Instead of each scientist recalculating read counts (such as FPKMs), one researcher can share smaller analysis files with others. The smaller files can then be used to identify and visualize meaningful differences in expression between genes or samples. This sharing is not only more efficient; it also facilitates the collaboration necessary to tackle such significant and complex medical issues as obesity, Alzheimer’s, and autoimmune diseases, all of which show differences in the expression levels of certain genes.

These examples just scratch the surface of the power behind filtering genome data for data analysis, visualization, and sharing.

With genome sequencing growing more affordable, the amount of sequence data is expected to grow more rapidly, so we need to change the mindset that we must start with the whole library to see results. Think “focused” instead of “complete,” or, as the librarians like to say, “precision over recall.”

It’s one more example of where less, most definitely, can be more.

The Future of Health and Health Care

I want to say one word to you. Just one word.

In the movie The Graduate, Benjamin receives one word, whispered in hushed tones, as guidance to a successful future: “plastics.” Today, the National Library of Medicine, and the NIH as a whole, would whisper “data.”

The future of health and health care rests on data—genomic data, environmental sensor-generated data, electronic health records data, patient-generated data, research collected data.

Why is data worth our attention now? Because data generated in one research project could be analyzed by others and help grow knowledge more quickly.

The data originating from research projects is becoming as important as the answers those research projects are providing. Various kinds of data originate from research, including genomic assays, responses to surveys, and environmental assessments of air quality and temperature. Making sure these data are effectively used in the original study is the responsibility of the investigators. But who will make sure that relevant parts of these very complex and expensive-to-generate data will remain available for use by other investigators? And maybe even more important, who will pay for making those data discoverable, secure, available, and actionable?

We believe the NLM must play a key role in preserving data generated in the course of research, whether conducted by professional scientists or citizen scientists. We know how to purposefully create collections of information and organize them for viewing and use by the public. We can extend this skill set to the curation of research data. We also have the utilities in place to protect the data by making sure only those individuals with permission to access data can actually do so.

We have much to learn along the way, for handling data is not straightforward, and the analytical methods that help us best learn from data await future development, but we have the foundation on which to build, the knowledge to get us going, and the tradition of service-inspired research that enables us to learn as we go.

Over the next few months I will outline NLM’s plan to become what the ACD report recommended—the “epicenter of data science for the NIH.” I look forward to your comments.