Reflections on the Work of the Research Data Alliance

The Research Data Alliance (RDA) is a community-driven, interdisciplinary, international organization dedicated to collaboratively building the social and technical infrastructure necessary for wide-scale data sharing and advancing open science initiatives. Just short of five years old, this group gathers twice a year at plenary meetings, the most recent just last week.

These are no big-lecture, hallway-conversation meetings. As I discovered in Berlin last week, they are working meetings, in the best sense of the phrase—where the work involves creating and validating the mechanisms and standards for data sharing. That work is done by volunteers from across disciplines—over 7,000 people engaged in small work groups, local activities, and conference-based sessions. These volunteers deliberate and construct standards for data sharing, and then establish strategies for testing and endorsing these standards and gaining community consensus and adoption—including partnering with notable standard-setting bodies such as ISO or IEEE.

Much of the work focuses on making data and data repositories FAIR— Findable, Accessible, Interoperable, and Reusable—which is something I’ve talked a lot about in this blog.

But RDA espouses a broader vision than the approach NLM has taken so far with data. Where we provide public access to full-text articles, some of which link to associated data, RDA advocates for putting all research-generated data in domain-specific, well-curated repositories.

To achieve that vision, RDA members are working to develop the following three key elements:

  • a schema to link data to articles,
  • a mechanism for citing data extracts, and
  • a way to recognize high-quality data repositories.

Right now, a single publisher may have 50 or 60 different ways of linking articles to data. That means that the estimated 25,000 publishers and 5,000 repositories that manage data have potentially millions of ways of accomplishing this task. Instituting a standardized schema to link data to articles would bring significant order and discoverability to this overwhelming diversity. That consistency would yield immediate benefits, tops among them making data findable and the links interoperable.

Efficient data citations will also be a boon to findability. RDA is working on developing dynamic data citations, which would provide persistent identifiers tying data extracts to their repositories and tracking different versions of the data. Machine-created and machine-readable, data citations would enhance rigor and reproducibility in research by ensuring the data generated in support of key findings remains accessible.

But linking to and tracking data won’t get us far if the data itself is untrustworthy.

To address that, RDA encourages well-curated repositories, but what exactly does that mean?

Certification provides one way of acknowledging the quality of a repository. RDA doesn’t sponsor a certification mechanism, but it recognizes several, including the CoreTrustSeal program.  (For more on data certification, see “A Primer on the Certifications of a Trusted Digital Repository,” by Dawei Lin from the NIH National Institute of Allergy and Infectious Diseases.)

But why does all this matter to NIH and to NLM specifically?

I came to the RDA meeting to explore complementary approaches to what NLM is already doing to curate and assign metadata to data. I was especially looking for guidance on how to handle new data types such as images and environmental exposures.

I got some of that, but I also learned that NLM has much to contribute to RDA’s work. Particularly given our expertise in clinical terminologies and literature languages, we add rich depth to the ways data and other resources can be characterized.

In addition, I learned that we at NLM and NIH face many of the same challenges as our global partners: efficiently managing legacy data while not constraining the future to the problems of the past; fostering the adoption of common approaches and standards when the benefit to the larger scientific community may be greater than the value to the individual investigator; coordinating a voluntary, community-led process that has mission-critical consequences; and creating a permanent home and support organization for the wide range of standards actually needed for data-driven discovery.

Finally, I learned that people participate in the work of RDA because it both draws on their expertise and advances their own scholarly efforts. In other words, it’s mutually beneficial. But after my time with the group last week, I suspect we all get more than we give. For NLM anyway—as we begin to implement our new strategic plan—RDA’s goal of creating a global data ecosystem of best practices, standards, and interoperable data infrastructures is encouraging and something to look forward to.

Next-Generation Data Science Research Challenges

NIH-funded research is rapidly becoming more and more data-driven. This is true whether that research is intramural or extramural or whether it is focused on solving concrete problems or advancing methodologies for specific domains.

Right now, NLM’s role in this data-driven research centers on developing scalable, sustainable, and generalizable methods for making biomedical data FAIR: Findable, Accessible, Interoperable, and Reusable.

Toward this end, NLM—on behalf of the NIH—released last fall a Request for Information on Next-Generation Data Science Challenge in Health and Biomedicine. We sought community input on data science research initiatives that could address the key challenges researchers, clinicians, administrators, and others currently faced. We invited suggestions for new data science research in six areas:

  • Data-driven Discovery
  • Data-driven Health Improvement
  • Advanced Data Management
  • Intelligent and Learning Systems for Health
  • Workforce Development and Diversity
  • New Stakeholder Partnerships

Fifty-three responses provided more than 180 pages of ideas and suggestions.

The topic “Data-driven Discovery” prompted input focused on developing methods and tools to help researchers derive insights from data. These suggestions fell into a number of areas particularly relevant to NLM, including help with natural language processing; predictive analytics to help generate hypotheses from hidden patterns; ways to extract and formalize scientific claims and causal statements from publications; and improved ontologies.

Ideas related to improving health through data recommended developing algorithms tied to patient similarity to drive comparative effectiveness research; nuanced characterizations of phenotypes (including severity, degree and certainty); and strategies to address bias in health records used for research purposes.

Suggestions concerning managing data revealed a need to better capture and curate that data. These included smoothly integrating personal data from mobile devices into clinical work flows; automatically assigning standardized metadata to existing data sets and digital files; sharing open source analytic methods; and developing technological platforms to help scientists store and analyze data.

Ideas for intelligent learning systems ranged widely, from brain science research focused on learning and retention, to approaches for engaging users with health data, to building flexible learning modules.

Many contributors recognized the need to develop a data-skilled workforce, and their suggestions extended beyond simply increasing the number of data scientists. They called for reaching out to high school and undergrad students to equip them earlier with the foundational skills and education that would make training in data science interesting and feasible; creating core informatics and data science skills for all researchers; and infusing the PhD in health informatics with required coursework in computer science and statistics, along with health and biomedicine.

Suggestions for stakeholder partnerships included the names of specific associations, federal agencies, and companies, as well as a shout-out for working more closely with those citizen scientists interested in taking advantage of the growing supply of publicly available health data.

Clearly the research challenges of the future will need strong investments from across NIH.

The Library’s early contributions will be three-fold:

  1. Serve as an honest broker to create a trans-NIH statement of the data science and informatics skills essential for all federally supported trainees (in NIH-funded research training programs located at universities as well as career grants).
  2. Stimulate research in advanced curation and information-integration methods.
  3. Accelerate the development of scalable, reusable, and generalizable visualization tools and analytical approaches.

What else do you think belongs in our portfolio? Chime in below.

Now is your chance to shape the next-generation research agenda for data science.

Request a summary of the responses to the RFI referenced in this post.

casual headshot of Valerie FloranceDr. Valerie Florance, co-author of this post, serves as Director of the NLM Division of Extramural Programs. She also coordinates NLM’s informatics training programs.

The Research Ecosystem

On the future of scientific communication

This week I will be part of the research ecosystem panel at the National Academy of Sciences’ Journal Summit. The theme of the summit is “The Evolving Ecosystem of Scientific Publishing.”

The summit encourages audience discussion and debate, so I’m hoping my fellow panelists and I can elicit (or even incite!) lively and fruitful discussions among scientific editors, nonprofit publishers, researchers, funders, academic directors, librarians, and IT specialists. It will be great fun, I am sure.

Great fun and serious business.

As NLM’s Director, it’s my job to think about the Library’s role in fostering scientific communication. But what does “scientific communication” actually entail? Is it the same as the research literature?

Many think so, and, of course, NLM is well-known for providing access to the research literature through PubMed and PubMed Central—resources we build consciously and intentionally to ensure the included publications meet key criteria and the content is reliably available, whether online or in print.

But scientific communication is rapidly expanding beyond journal articles. In our corner of the world, for example, PubMed Central has been accepting small data files (less than 2Gb) along with submitted manuscripts since last October.

But I believe the realm of scientific communication will expand even further.

I see on the horizon an era of scientific communication influenced by the principals of open science, which support sharing not only the answer to a research hypothesis but also the products and processes used to get to that answer.

Two key practices will help usher in this new era:

  1. Communicating early and often. For example, because allows investigators to upload a range of research elements, including data collection instruments, analytical plans, and human subjects agreements, interim products of a research process can be available to others long before the final article is in place.
  2. Sharing all components of the research process, not simply summative reports. Last fall I suggested a library of models, properly documented and vetted, to allow researchers to apply existing and trustable models to their data. Data visualizations, source code, and videos might also prove useful. And you might have other ideas. (Please comment below and tell us about them.)

Such sharing of tools, products, and processes will save valuable time and money while also enhancing the rigor and reproducibility of the research itself by opening for examination all the procedural and methodological details. It also promises to speed innovation and knowledge transfer, which, you might say, are two of the key reasons for scientific communication in the first place.

But we still have so much to learn and to discuss. And, of course, NLM can’t shape the future of scientific communication alone.

That’s what makes this week’s Journal Summit so exciting. It’ll bring together many of the stakeholders in the research process to brainstorm strategies for tackling the numerous challenges that stand between us and open science.

But since most of you won’t be able to attend, I invite your comments regarding the research ecosystem and scientific communication below.

Among the questions I’d appreciate your input on are the following:

  • Should the scientific literature remain at the center of the discovery process, with the related research elements accessible from there?
  • How might preprints, which provide early looks into studies’ findings, serve as a model for the early disclosure and discussion of research methods?
  • What roles and imprimaturs could be afforded by a “publisher” of data?
  • What services should NLM institute to help make its collections and data FAIR (Findable, Accessible, Interoperable, and Reusable)?
  • Where should NLM invest its resources to accelerate the discovery, use, and impact of scientific communication?

Let’s spark a debate here that will rival the best the Journal Summit has to offer!


Help NLM and NIH Shape a Data-Driven Future

It’s a busy and exciting time for the National Library of Medicine and the National Institutes of Health.

This week we released NLM’s strategic plan, A Platform for Biomedical Discovery and Data-Powered Health.  Concurrently the National Institutes of Health announced a draft Strategic Plan for Data Science. The intersection of these two important documents demonstrates the alignment of the NLM vision within the overall thrust at NIH to transform discovery into health.

Positioning NLM for the Future

Representing the work of hundreds of NLM staff, national experts, and commenters from around the world, the NLM strategic plan lays out our current challenges and positions us to address these and emerging issues in biomedical research and public health.

From the need to be present in all environments where health and health care occur—and not just in structured, clinical settings—to the changing nature of libraries and how people pursue information, NLM is ready to embrace the spirit of open science and deliver on the promise of data-driven discovery.

As I’ve noted in previous blog posts, we’re going to get there by building on three pillars:

  • Establishing NLM as a platform for data-driven discovery and health
  • Reaching new users in new ways
  • Enhancing workforce excellence from citizens to scientists

So, what does that mean we will be doing?

We’ve already begun making data more accessible by allowing researchers to deposit data files as supplements to manuscripts they submit to PubMed Central. We’re helping to build the NIH Data Commons and working across NIH to improve identity and access management.

We’ve launched a new research program to devise ways to bring the power of data science into the hands of patients, and we’ll be investing further in data science training for librarians, biomedical researchers, and the bioinformatics community.

We’re also envisioning new research horizons.

We will be investing in novel approaches to curating data and literature, so we can make both more accessible more quickly and efficiently. We’re working with investigators to build needed analytical and visualization tools that can be applied to many different data types. We will be stimulating research in how health information can be presented to the public in fresh and innovative ways. And we will be devising new methods for exploring the literature and linking the key research elements: proposals, data, literature, models, and pipelines.

But that’s just the beginning.

As you read the NLM Strategic Plan, let us know if you see yourself in it.

Are your needs around health information and data represented? Does our vision of a data-driven future sound like something that will energize your research or simplify your work?  Will we be delivering something you need and can use—whether that’s genomic databases and the tools to interrogate them; open resources for citizen-scientists; clear, interactive interfaces for librarians and their patrons; or insights into health care’s tech future for students? What more might we do?

Your comments are welcome and encouraged. Please submit them via the NLM Strategic Plan page.

NIH Strategic Plan for Data Science

NLM does not venture into a data-focused future alone. NIH also works in and advocates for a research world that is increasingly data-driven, and NIH leadership clearly sees and appreciates the scientific opportunities presented by advances in data science.

To capitalize on those opportunities, NIH is developing a Strategic Plan for Data Science. As Dr. Jon Lorsch explained recently to NLM’s Board of Regents, this plan addresses NIH’s overarching goals, strategic objectives, and implementation tactics to modernize what he termed the “NIH-funded biomedical data science ecosystem.”

NIH just published a draft of the strategic plan, along with a Request for Information, to seek input from stakeholders, including members of the scientific and academic communities, health professionals, patient, professional,and advocacy groups, the private sector, and interested members of the public.

I encourage your comments and suggestions on the NIH draft plan. Submit your responses online by March 30, 2018.


NLM Celebrates Fair Use

Guest post by NLM Associate Fellow Gabrielle Barr and NLM Copyright Group co-chairs Christie Moffatt and Rebecca Goodwin.

It’s Fair Use Week 2018, an annual event coordinated by the Association for Research Libraries (ARL) to celebrate the opportunities of fair use, including the many ways it supports biomedical research and the work we do at here at NLM.

Fair use is a legal doctrine that asserts the right to use materials under copyright in a limited manner without the copyright holder first granting permission. In practice, fair use is a balance between the rights of copyright holders and the rights of researchers, authors, educators, students, artists, and others, as we work as a society to promote science, education, and the arts.

Section 107 of the US Copyright Act provides the details of fair use, but the University of Virginia Library nicely summed it up in only seven words: “Use fairly. Not too much. Have reasons.”

Infographic: Fair Use Promotes the Creation of New Knowledge
Click image to view full infographic.

Libraries regularly champion fair use because of the way it supports research and education, but also because it enables libraries to fulfill their primary mission of providing and preserving information.

The same holds true here at NLM.

NLM’s fair use policies, based on ARL’s best practices, support access to library resources, encourage teaching and learning, allow preserving at-risk materials and collecting web-based content for future scholarship, and facilitate new modes of computational research and data-mining.

From digitizing content to building institutional repositories to creating physical and digital exhibitions, NLM applies fair use in a variety of ways. We maintain the NLM Digital Collections to provide access to historical books, photographs, videos, manuscripts, and maps. We collect web-based “born digital” content documenting major global health events such as the 2014 Ebola Outbreak. We digitize films for the History of Medicine Division’s (HMD) collection of Medical Movies on the Web, showcase materials in physical and online exhibitions, and promote our collections via blogs such as HMD’s Circulating Now. We incorporate copyrighted content into online courses and tutorials for NLM systems such as MEDLINE®, PubMed®, the Unified Medical Language System®, and the Value Set Authority Center. And we include stubs of proprietary clinical assessment instruments in the NIH Common Data Elements Repository to help researchers standardize clinical data.

Now NLM is considering how fair use can accommodate our evolving needs in the technology-rich and data-driven future.

The Library strongly supports the FAIR Data Principles, which affirm that data and other digital objects representing the products and processes of modern biomedical science are Findable, Accessible, Interoperable, and Reusable (FAIR). And we rely increasingly on algorithms, APIs, computer software, searchable databases, and search engines that enable data mining for intellectual purposes.

While fair use can ensure access to and use of these tools and data, recent federal court decisions indicate the intersection of copyright law with APIs and computer software remain part of the fair use frontier. Each new ruling has the potential to redefine current practice and requirements.

In this time of shifting sand, it’s no surprise that ARL’s forthcoming Code of Best Practices in Fair Use for Software Preservation (expected this fall) involves extensive research and interviews with software preservation experts and other stakeholders. Their ability to articulate the complex issues related to software and fair use could significantly impact libraries’ future work preserving today’s digital record.

In the meantime, NLM is forging ahead, applying fair use to advance medical education, biomedical research and discovery, and data-powered health.

We’d love to hear from other institutions on how you employ fair use and the steps you take to balance the rights of copyright holders with those of researchers, educators, and artists. Comment below or drop a note to the NLM Copyright Group.

casual headshot of Gabrielle Barr

Gabrielle Barr, MSI, is an NLM Associate Fellow. Before coming to NLM, she worked in the special collections of Norfolk Public Library and as a project assistant for the Health Sciences Library at the University of North Carolina at Chapel Hill. She received her master of science in information and a certificate in science, technology, and society from the University of Michigan in 2015.


casual headshot of Christie Moffatt

Christie Moffatt, MLS, serves as co-chair of the NLM Copyright Group, manager of the Digital Manuscripts Program in the History of Medicine Division, and chair of the NLM Web Collecting and Archiving Working Group. She earned her master’s degree in library science at the University of North Carolina at Chapel Hill, with a concentration in archives and manuscripts.


headshot of Rebecca Goodwin

Rebecca Goodwin, JD, serves as co-chair of the NLM Copyright Group and as a data science specialist in the Office of Health Information Programs Development. Previously, she served as special assistant to the director of the Lister Hill National Center for Biomedical Communications. She came to NIH in 2007 as a Presidential Management Fellow after earning her JD from the University of Florida Levin College of Law.