Reflections on the Work of the Research Data Alliance

The Research Data Alliance (RDA) is a community-driven, interdisciplinary, international organization dedicated to collaboratively building the social and technical infrastructure necessary for wide-scale data sharing and advancing open science initiatives. Just short of five years old, this group gathers twice a year at plenary meetings, the most recent just last week.

These are no big-lecture, hallway-conversation meetings. As I discovered in Berlin last week, they are working meetings, in the best sense of the phrase—where the work involves creating and validating the mechanisms and standards for data sharing. That work is done by volunteers from across disciplines—over 7,000 people engaged in small work groups, local activities, and conference-based sessions. These volunteers deliberate and construct standards for data sharing, and then establish strategies for testing and endorsing these standards and gaining community consensus and adoption—including partnering with notable standard-setting bodies such as ISO or IEEE.

Much of the work focuses on making data and data repositories FAIR— Findable, Accessible, Interoperable, and Reusable—which is something I’ve talked a lot about in this blog.

But RDA espouses a broader vision than the approach NLM has taken so far with data. Where we provide public access to full-text articles, some of which link to associated data, RDA advocates for putting all research-generated data in domain-specific, well-curated repositories.

To achieve that vision, RDA members are working to develop the following three key elements:

  • a schema to link data to articles,
  • a mechanism for citing data extracts, and
  • a way to recognize high-quality data repositories.

Right now, a single publisher may have 50 or 60 different ways of linking articles to data. That means that the estimated 25,000 publishers and 5,000 repositories that manage data have potentially millions of ways of accomplishing this task. Instituting a standardized schema to link data to articles would bring significant order and discoverability to this overwhelming diversity. That consistency would yield immediate benefits, tops among them making data findable and the links interoperable.

Efficient data citations will also be a boon to findability. RDA is working on developing dynamic data citations, which would provide persistent identifiers tying data extracts to their repositories and tracking different versions of the data. Machine-created and machine-readable, data citations would enhance rigor and reproducibility in research by ensuring the data generated in support of key findings remains accessible.

But linking to and tracking data won’t get us far if the data itself is untrustworthy.

To address that, RDA encourages well-curated repositories, but what exactly does that mean?

Certification provides one way of acknowledging the quality of a repository. RDA doesn’t sponsor a certification mechanism, but it recognizes several, including the CoreTrustSeal program.  (For more on data certification, see “A Primer on the Certifications of a Trusted Digital Repository,” by Dawei Lin from the NIH National Institute of Allergy and Infectious Diseases.)

But why does all this matter to NIH and to NLM specifically?

I came to the RDA meeting to explore complementary approaches to what NLM is already doing to curate and assign metadata to data. I was especially looking for guidance on how to handle new data types such as images and environmental exposures.

I got some of that, but I also learned that NLM has much to contribute to RDA’s work. Particularly given our expertise in clinical terminologies and literature languages, we add rich depth to the ways data and other resources can be characterized.

In addition, I learned that we at NLM and NIH face many of the same challenges as our global partners: efficiently managing legacy data while not constraining the future to the problems of the past; fostering the adoption of common approaches and standards when the benefit to the larger scientific community may be greater than the value to the individual investigator; coordinating a voluntary, community-led process that has mission-critical consequences; and creating a permanent home and support organization for the wide range of standards actually needed for data-driven discovery.

Finally, I learned that people participate in the work of RDA because it both draws on their expertise and advances their own scholarly efforts. In other words, it’s mutually beneficial. But after my time with the group last week, I suspect we all get more than we give. For NLM anyway—as we begin to implement our new strategic plan—RDA’s goal of creating a global data ecosystem of best practices, standards, and interoperable data infrastructures is encouraging and something to look forward to.

Next-Generation Data Science Research Challenges

NIH-funded research is rapidly becoming more and more data-driven. This is true whether that research is intramural or extramural or whether it is focused on solving concrete problems or advancing methodologies for specific domains.

Right now, NLM’s role in this data-driven research centers on developing scalable, sustainable, and generalizable methods for making biomedical data FAIR: Findable, Accessible, Interoperable, and Reusable.

Toward this end, NLM—on behalf of the NIH—released last fall a Request for Information on Next-Generation Data Science Challenge in Health and Biomedicine. We sought community input on data science research initiatives that could address the key challenges researchers, clinicians, administrators, and others currently faced. We invited suggestions for new data science research in six areas:

  • Data-driven Discovery
  • Data-driven Health Improvement
  • Advanced Data Management
  • Intelligent and Learning Systems for Health
  • Workforce Development and Diversity
  • New Stakeholder Partnerships

Fifty-three responses provided more than 180 pages of ideas and suggestions.

The topic “Data-driven Discovery” prompted input focused on developing methods and tools to help researchers derive insights from data. These suggestions fell into a number of areas particularly relevant to NLM, including help with natural language processing; predictive analytics to help generate hypotheses from hidden patterns; ways to extract and formalize scientific claims and causal statements from publications; and improved ontologies.

Ideas related to improving health through data recommended developing algorithms tied to patient similarity to drive comparative effectiveness research; nuanced characterizations of phenotypes (including severity, degree and certainty); and strategies to address bias in health records used for research purposes.

Suggestions concerning managing data revealed a need to better capture and curate that data. These included smoothly integrating personal data from mobile devices into clinical work flows; automatically assigning standardized metadata to existing data sets and digital files; sharing open source analytic methods; and developing technological platforms to help scientists store and analyze data.

Ideas for intelligent learning systems ranged widely, from brain science research focused on learning and retention, to approaches for engaging users with health data, to building flexible learning modules.

Many contributors recognized the need to develop a data-skilled workforce, and their suggestions extended beyond simply increasing the number of data scientists. They called for reaching out to high school and undergrad students to equip them earlier with the foundational skills and education that would make training in data science interesting and feasible; creating core informatics and data science skills for all researchers; and infusing the PhD in health informatics with required coursework in computer science and statistics, along with health and biomedicine.

Suggestions for stakeholder partnerships included the names of specific associations, federal agencies, and companies, as well as a shout-out for working more closely with those citizen scientists interested in taking advantage of the growing supply of publicly available health data.

Clearly the research challenges of the future will need strong investments from across NIH.

The Library’s early contributions will be three-fold:

  1. Serve as an honest broker to create a trans-NIH statement of the data science and informatics skills essential for all federally supported trainees (in NIH-funded research training programs located at universities as well as career grants).
  2. Stimulate research in advanced curation and information-integration methods.
  3. Accelerate the development of scalable, reusable, and generalizable visualization tools and analytical approaches.

What else do you think belongs in our portfolio? Chime in below.

Now is your chance to shape the next-generation research agenda for data science.

Request a summary of the responses to the RFI referenced in this post.

casual headshot of Valerie FloranceDr. Valerie Florance, co-author of this post, serves as Director of the NLM Division of Extramural Programs. She also coordinates NLM’s informatics training programs.

The Research Ecosystem

On the future of scientific communication

This week I will be part of the research ecosystem panel at the National Academy of Sciences’ Journal Summit. The theme of the summit is “The Evolving Ecosystem of Scientific Publishing.”

The summit encourages audience discussion and debate, so I’m hoping my fellow panelists and I can elicit (or even incite!) lively and fruitful discussions among scientific editors, nonprofit publishers, researchers, funders, academic directors, librarians, and IT specialists. It will be great fun, I am sure.

Great fun and serious business.

As NLM’s Director, it’s my job to think about the Library’s role in fostering scientific communication. But what does “scientific communication” actually entail? Is it the same as the research literature?

Many think so, and, of course, NLM is well-known for providing access to the research literature through PubMed and PubMed Central—resources we build consciously and intentionally to ensure the included publications meet key criteria and the content is reliably available, whether online or in print.

But scientific communication is rapidly expanding beyond journal articles. In our corner of the world, for example, PubMed Central has been accepting small data files (less than 2Gb) along with submitted manuscripts since last October.

But I believe the realm of scientific communication will expand even further.

I see on the horizon an era of scientific communication influenced by the principals of open science, which support sharing not only the answer to a research hypothesis but also the products and processes used to get to that answer.

Two key practices will help usher in this new era:

  1. Communicating early and often. For example, because allows investigators to upload a range of research elements, including data collection instruments, analytical plans, and human subjects agreements, interim products of a research process can be available to others long before the final article is in place.
  2. Sharing all components of the research process, not simply summative reports. Last fall I suggested a library of models, properly documented and vetted, to allow researchers to apply existing and trustable models to their data. Data visualizations, source code, and videos might also prove useful. And you might have other ideas. (Please comment below and tell us about them.)

Such sharing of tools, products, and processes will save valuable time and money while also enhancing the rigor and reproducibility of the research itself by opening for examination all the procedural and methodological details. It also promises to speed innovation and knowledge transfer, which, you might say, are two of the key reasons for scientific communication in the first place.

But we still have so much to learn and to discuss. And, of course, NLM can’t shape the future of scientific communication alone.

That’s what makes this week’s Journal Summit so exciting. It’ll bring together many of the stakeholders in the research process to brainstorm strategies for tackling the numerous challenges that stand between us and open science.

But since most of you won’t be able to attend, I invite your comments regarding the research ecosystem and scientific communication below.

Among the questions I’d appreciate your input on are the following:

  • Should the scientific literature remain at the center of the discovery process, with the related research elements accessible from there?
  • How might preprints, which provide early looks into studies’ findings, serve as a model for the early disclosure and discussion of research methods?
  • What roles and imprimaturs could be afforded by a “publisher” of data?
  • What services should NLM institute to help make its collections and data FAIR (Findable, Accessible, Interoperable, and Reusable)?
  • Where should NLM invest its resources to accelerate the discovery, use, and impact of scientific communication?

Let’s spark a debate here that will rival the best the Journal Summit has to offer!


Help NLM and NIH Shape a Data-Driven Future

It’s a busy and exciting time for the National Library of Medicine and the National Institutes of Health.

This week we released NLM’s strategic plan, A Platform for Biomedical Discovery and Data-Powered Health.  Concurrently the National Institutes of Health announced a draft Strategic Plan for Data Science. The intersection of these two important documents demonstrates the alignment of the NLM vision within the overall thrust at NIH to transform discovery into health.

Positioning NLM for the Future

Representing the work of hundreds of NLM staff, national experts, and commenters from around the world, the NLM strategic plan lays out our current challenges and positions us to address these and emerging issues in biomedical research and public health.

From the need to be present in all environments where health and health care occur—and not just in structured, clinical settings—to the changing nature of libraries and how people pursue information, NLM is ready to embrace the spirit of open science and deliver on the promise of data-driven discovery.

As I’ve noted in previous blog posts, we’re going to get there by building on three pillars:

  • Establishing NLM as a platform for data-driven discovery and health
  • Reaching new users in new ways
  • Enhancing workforce excellence from citizens to scientists

So, what does that mean we will be doing?

We’ve already begun making data more accessible by allowing researchers to deposit data files as supplements to manuscripts they submit to PubMed Central. We’re helping to build the NIH Data Commons and working across NIH to improve identity and access management.

We’ve launched a new research program to devise ways to bring the power of data science into the hands of patients, and we’ll be investing further in data science training for librarians, biomedical researchers, and the bioinformatics community.

We’re also envisioning new research horizons.

We will be investing in novel approaches to curating data and literature, so we can make both more accessible more quickly and efficiently. We’re working with investigators to build needed analytical and visualization tools that can be applied to many different data types. We will be stimulating research in how health information can be presented to the public in fresh and innovative ways. And we will be devising new methods for exploring the literature and linking the key research elements: proposals, data, literature, models, and pipelines.

But that’s just the beginning.

As you read the NLM Strategic Plan, let us know if you see yourself in it.

Are your needs around health information and data represented? Does our vision of a data-driven future sound like something that will energize your research or simplify your work?  Will we be delivering something you need and can use—whether that’s genomic databases and the tools to interrogate them; open resources for citizen-scientists; clear, interactive interfaces for librarians and their patrons; or insights into health care’s tech future for students? What more might we do?

Your comments are welcome and encouraged. Please submit them via the NLM Strategic Plan page.

NIH Strategic Plan for Data Science

NLM does not venture into a data-focused future alone. NIH also works in and advocates for a research world that is increasingly data-driven, and NIH leadership clearly sees and appreciates the scientific opportunities presented by advances in data science.

To capitalize on those opportunities, NIH is developing a Strategic Plan for Data Science. As Dr. Jon Lorsch explained recently to NLM’s Board of Regents, this plan addresses NIH’s overarching goals, strategic objectives, and implementation tactics to modernize what he termed the “NIH-funded biomedical data science ecosystem.”

NIH just published a draft of the strategic plan, along with a Request for Information, to seek input from stakeholders, including members of the scientific and academic communities, health professionals, patient, professional,and advocacy groups, the private sector, and interested members of the public.

I encourage your comments and suggestions on the NIH draft plan. Submit your responses online by March 30, 2018.


Power and Finesse: The NLM Board of Regents

Running an operation as large and complex as the National Library of Medicine is a big job, but I don’t do it alone. In addition to my leadership team, I am privileged to have the NLM Board of Regents to help me.

Established in 1956 by the same Act that created the Library, the Board of Regents advises me on matters ranging from the acquisition of materials for the Library to the scope, content, and organization of NLM’s services to the rules governing access to those materials and services. The Board also makes recommendations for funding research and training in bioinformatics and educational technologies, suggests demonstration projects, and proposes ways to expand or enhance the biomedical communications network of which we are a part.

In short, the Board helps guide the overall work of the Library.

Given the diverse work we do and the breadth of topics we address, the Board’s membership includes leaders from across the library and life sciences, including medicine, public health, and health communications technology. They are joined by nine ex officio members whose positions read like a Who’s Who in health and librarianship, including four Surgeons General (Public Health Service, Army, Navy, and Air Force) and two national library directors (Library of Congress and National Agricultural Library).

Being in the room with them is like driving a Ferrari—things are moving fast but with finesse. And the power under the hood? Phenomenal.

It’s a blast.

With their various areas of expertise and different perspectives, Board members raise questions, highlight issues, or suggest innovations we hadn’t previously considered. Clinicians typically advocate for improvements to information management and delivery. Researchers point us towards important unsolved challenges. Consumer representatives voice the concerns and interests of patients and caregivers. Delegates from business help us leverage cutting-edge solutions coming out of private industry. And our ex officio members, as Federal partners, connect us to other parts of the government whose problems and constraints are similar to our own.

But the value of the Board is more than the individual members’ perspectives.

It’s the synergy that builds by bringing them together three times a year. It’s the lively conversations their close collaboration sparks, as they discuss NLM’s programs, services, and research initiatives. It’s their careful, considered deliberation of our research investments. And, most recently, it’s their collective effort in crafting our strategic plan for the coming decade.

Last week, after 16 months of activities involving over 500 experts and stakeholders, the Board endorsed that plan, positioning NLM for its third century. The plan envisions NLM as a platform for data-driven discovery and data-powered health, built upon three pillars:

  1. Accelerating discovery and advancing health through data-driven research
  2. Reaching more people in more ways through enhanced dissemination and engagement
  3. Building a workforce for data-driven research and health

Now the hard work begins.

Implementing the strategic plan will require fresh perspectives, new talents, and expanded resources. We will need to build a model of trust and accountability among our 1,700 women and men, encouraging them to fully contribute their skills and ideas and to envision their work in novel ways. We will have to make tradeoffs and set priorities. And as we work to make NLM’s bright future a reality, we will need to advocate for and embrace boldness and risk-taking.

Fortunately, we have the NLM Board of Regents to guide the way.

As their work proves, multiple perspectives spur innovation and creative problem-solving; collegiality supports accountability; and respectful advocacy—whether to each other, to the NIH Director, or to the Secretary of Health and Human Services—can lead to tremendous change for the greater good.  What more could we need to accelerate the progress towards our third century?!