What Did You Do with Your Summer Vacation?

Well, if you are spending the summer at the NIH, you’ve likely been engaged in one of our many activities designed to access critical data and advance our understanding of the human experience by linking data sets together. Today, we are inviting you to engage in some additional best practices in accessing controlled data in ways that support science and preserve privacy.

In 2020, the NIH Scientific Data Council charged its Working Group for Streamlining Access to Controlled Data to spend a year engaging in dialogue within the NIH and with our extramural colleagues to better understand the experiences of scientists and the strategies that both facilitate and impede access to data. The group also considered where in the research process NIH should inform, engage, and gain consent of participants sufficiently to support science driven by access to controlled datasets.

NIH stores and facilitates access to many datasets, both open and controlled, with the goal of accelerating new discoveries and thereby maximizing taxpayer return on investment in the collection of these datasets. Data derived from humans that are shared through controlled-access mechanisms reflect NIH’s commitment to protect sensitive data and honor the informed consent provided by research participants in NIH-supported studies.

NIH has supported multiple controlled-access data repositories that uphold appropriate data protections for both human data and other sensitive data, while meeting the needs of various researcher communities. However, as data access requests increase, new repositories are established, and new mechanisms of providing access to data are developed, it is apparent that opportunities remain to improve efficiency and harmonization among repositories to make NIH-supported controlled-access data more FAIR: Findable, Accessible, Interoperable, and Reusable and to ensure appropriate oversight when data from different resources are combined. While these trends are enabling datasets and datatypes to be combined in new ways that advance the science, datasets, and datatypes that may or may not be controlled may, when combined, create inadvertent re-identification risks.

To help the agency address these issues in a way that is responsive to community needs, we are hosting a series of webinars through the end of July. We call these “breakout sessions” because they follow an outstanding webinar presented on July 9 available here. Richard Hodes, MD, director of the National Institute on Aging, launched the 3-hour seminar with a talk titled Opportunities for Advancing Research Through Better Access to Controlled Data. Ana Navas-Acien, MD, PhD, brought the perspective of indigenous and communities of people traditionally underrepresented in research, and she emphasized themes of community engagement and broadening the consent framework to consider community-level accountabilities as well as individual assent. Lucila Ohno-Machado, MD, MBA, PhD, addressed privacy preserving distributed analytics as a strategy to promote science while preserving privacy of data. Hoon Cho, PhD, described privacy-enhancing computational approaches to privacy preservation.

You can find the schedule for the breakout sessions below. These sessions are specifically designed to listen to the expectations, hopes, and concerns from researchers and participants. These webinars are free and open to the public; registration is required.

Breakout Session on “Making Controlled-Access Data Readily Findable and Accessible” on July 22 from 3 pm to 5:30 pm EST

Breakout Session on “General Opportunities for Streamlining Access to Controlled Data” on July 26 from 12:30 pm to 2 pm EST

Breakout Session on “Addressing Oversight, Governance, and Privacy Issues in Linking Controlled Access Data from Different Resources” on July 28 from 3 pm to 5:30 pm EST

To generate interest and hear from the broadest possible group of stakeholders, NIH has released a Request for Information on Streamlining Access to Controlled Data from NIH Data Repositories. Please note the closing date is August 9. We look forward to hearing from you! Please visit Streamlining Access to Controlled Data at the NIH for all of the information described in this post.

Finally, we would like to personally thank the many NIH staff members who serve on the working group:

  • Shu Hui Chen
  • Alicia Chou
  • Valentina Di Francesco
  • Greg Farber
  • Jamie Guidry Auvil
  • Nicole Garbarini
  • Lyric Jorgenson
  • Punam Mathur
  • Vivian Ota Wang
  • Jonathan Pollock
  • Rebecca Rodriguez
  • Alex Rosenthal
  • Steve Sherry
  • Julia Slutsman
  • Erin Walker
  • Alison Yao

I hope your summer vacation was as productive as ours!

(left to right)
Patricia Flatley Brennan, RN, PhD, NLM Director
Susan Gregurick, PhD, Associate Director for Data Science at NIH
Hilary S. Leeds, JD, Senior Health Science Policy Analyst for the Office of Science Policy at NIH

Data Science @ NLM Journey Continues and What We Have Learned!

Guest post by the Data Science @ NLM Training Program team.

As part of our effort to advance Goal 3 of the NLM Strategic Plan (“Build a workforce for data driven research and health”), NLM launched the Data Science @ NLM (DS@NLM) Training Program in 2019 to help ensure that all staff are prepared to engage with and participate in NLM’s developing data science efforts.

Our efforts have stayed on track despite the changes caused by the COVID-19 pandemic, and we’re proud to highlight DS@NLM events held during the past year. We’re also sharing lessons learned throughout the training program, which are applicable to any individual or organization trying to help develop data science skills in the fields of health and biomedical information.

Earlier this month, we marked two years of the DS@NLM Training Program with a Spring Fling series of virtual events celebrating the data science training achievements of NLM staff.

Our Spring Fling kicked off with “lightning talk” presentations featuring several graduates of our intensive Data Science Fundamentals course, who shared their final class projects with NLM colleagues. Participants in our year-long Data Science Mentorship program also had the opportunity to present their Capstone projects. Our program mentees, who were mentored by NLM staff members, developed their data science skills by completing projects that applied data science techniques to help improve NLM operations.

What We’ve Learned:

Be responsive to specific needs; one size does NOT fit all.

Data plays a role in virtually everything we do at NLM, and as we aim to provide data training opportunities for staff working in many different areas, we recognize that different staff members have unique training needs. New training opportunities for some staff, such as our researchers, may hinge on their knowledge of machine learning. Metadata specialists may have more need for data cleaning or text processing skills, while administrators may benefit more from learning about data visualization.

People also learn in different ways, be it through shorter webinars and workshops, longer intensive courses, or self-directed learning. The DS@NLM program provides a variety of activities to meet these needs, including opportunities for various skill levels and topics, from short webinars to on-demand classes to ten-week intensive training courses.

Be responsive to staff feedback; give people what they ask for.

To help us determine what to offer, we engaged directly with our audience, asking NLM staff what they needed and listening to their responses. Because of the wide variety of work done at NLM, receiving feedback from staff helped us better understand their specific training needs. While we cannot always offer individualized programs to meet every need, staff feedback always helps us discover new ideas for future programming.

Teaching skills is just the beginning; applying new skills is essential.

A key lesson learned from staff feedback is that teaching new data skills is important, but that’s not enough on its own; teaching how to put newly acquired data skills to use in the real world or applying it to their work is just as important. Helping staff learn to apply data science techniques to their work transforms this new knowledge from theoretical to practical. The Data Science Mentorship Program, with its concluding Capstone project, is a great example of an opportunity for staff to both develop skills and practice applying them.

We applaud and celebrate all the hardworking staff from across NLM who have taken advantage of these training opportunities to advance the goal of building a workforce for data driven research and health, both at NLM and throughout the biomedical and health sciences information world.

Share with us and others how you are helping your staff apply data science skills in your organization—do you have any lessons learned?

Data Science @ NLM Training Program team
 
Top Row (left to right)
Dianne Babski, Associate Director, Library Operations
Maria Collins, Data & Systems Liaison, Office of the Associate Director for Library Operations
Peter Cooper, Strategic Communications Team Lead, National Center for Biotechnology Information

Bottom Row (left to right):
Mike Davidson, Librarian, Office of Engagement and Training, Division of Library Operations
Lisa Federer, NLM Data Science and Open Science Librarian, Office of Strategic Initiatives
Anna Ripple, Information Research Specialist, Lister Hill National Center for Biomedical Communications

Upcoming Training Opportunity: University-based Training for Research Careers in Biomedical Informatics and Data Science

Guest blog by Valerie Florance, PhD, Director of NLM’s Division of Extramural Programs

Explore the Training

NLM’s Extramural Programs Division is a principle source of NIH funding for research training in biomedical informatics, applying approaches in computer and information science to challenges in basic biomedical research, health care, and public health administration. NLM’s support fundamentally shapes the education, training, and advancement of biomedical informatics nationally. For decades, NLM has sponsored university-based training for predoctoral and postdoctoral fellows to prepare them for research careers. These programs support NLM’s long-term investment strategy to help influence and impact the field of biomedical informatics and data science.

Last October, NLM published NOT-LM-21-001 in the NIH Guide for Grants and Contracts to allow potential applicants sufficient time to develop meaningful collaborations and responsive projects. This program, a model among NIH training programs, advances training with big data in biomedical informatics and produces interdisciplinary, researchers that fully comprehend the challenges of knowledge representation, decision support, translational research, human-computer interaction, and social and organizational factors that influence effective adoption of health information technology in biomedical domains. This notice was the first step in a year-long process that will result in new 5-year grant awards that begin in July 2022. You’ll find the notice outlines the expected timetable for publishing the funding opportunity announcement, accepting applications, reviewing them and making awards.

The solicitation for new applications will be published in the NIH Guide for Grants and Contracts in March with applications due in May. For those interested in applying for an NLM training grant for the first time, we encourage a review of the previous solicitation to get a sense of the data and programmatic descriptions that are required for a training grant application.

Because issuance dates for the next competition are estimates, it is also helpful to subscribe to the weekly Table of Contents emails from the NIH Guide for Grants and Contracts. The extra benefit of this weekly mailing is that it lists all new funding issuances from NIH plus important notices about policy changes.

A Strong Foundation

NLM’s training programs offer graduate education and postdoctoral research experiences in a wide range of areas including health care informatics, translational bioinformatics, clinical research informatics, public health informatics, and biomedical data science. Each of these programs offer a combination of core curriculum and electives. In the current 5-year cycle, seven programs also offer special tracks in environmental exposure informatics supported by NIH’s National Institute of Environmental and Health Sciences.

A decades-old project, the university-based training initiatives is one of NLM’s signature grant programs. NLM’s training programs have produced many leaders in the field of biomedical informatics. Past trainees have taken positions in academia, industry, small businesses, health care organizations, and government. Currently, NLM supports 200 trainee positions at 16 universities around the United States and provides funding each year for up to 40 short-term trainee positions that are used to help recruit college graduates to our field by providing introductory training and research opportunities. To develop a sense of community among the trainees, NLM brings its trainees together each year, apart from those falling within a pandemic year, for an annual conference hosted at one of the university sites.

You can find a map with links to descriptions of the current programs here. The website also provides links to information about past annual conferences – check out past agendas to get a sense of the broad scope of science across the field of biomedical informatics.

Attendees comparing notes at NLM Informatics Training Conference 2017 in La Jolla, California

Did you take part in this training? What was your favorite thing about this experience? What advice would you give to current students? How can we make the program even better?

 Dr. Florance heads NLM’s Extramural Programs Division, which is responsible for the Library’s grant programs and coordinates NLM’s informatics training programs. 

Biomedical Discovery through SRA and the Cloud

Guest post by Jim Ostell, PhD, Director of the National Library of Medicine’s National Center for Biotechnology Information, National Institutes of Health.

NLM’s Sequence Read Archive (SRA) is used by more than 100,000 researchers every month, and those researchers now have a tremendous new opportunity to query this database of high-throughput sequence data in new ways for novel discovery: via the cloud. NLM has just finished moving SRA’s public data to the cloud, completing the first phase of an ongoing effort to better position these data for large-scale computing.  

To understand the importance of this move, it’s helpful to consider the analogy of how humans slowly improved their knowledge of the surface of the Earth.

The first simple maps allowed knowledge of terrain to be passed from people who had been there to those who hadn’t. Over the centuries, we learned to sail ships over the oceans and capture new knowledge in navigation charts and techniques. And we learned to fly airplanes over an area of interest and automatically capture detailed images of not only terrain, but also buildings and reservoirs, and assess the conditions of forest, field, and agricultural resources.

Today, with Earth-orbiting satellites, we no longer need to determine in advance what we want to view. We just photograph the whole Earth, all day, every day, and store all the data in a big database. Then we mine the data afterward. The significant change here is that not only can we follow, in great detail, locations or features on the Earth that we already know we’re interested in, as in aerial photography, but we can also discover new things of interest. Examples abound: for instance, noticing a change in a military base, and going back in time to see when the change began or how it developed; or seeing a decline in a forest or watershed, going back in time to see how this decline developed, and then looking geographically to see if it’s happening in other places in the world.

Scientists also can develop new algorithms to extract information from the corpus, or collection, of information. For example, archeologists looking for faint straight-line features indicative of ancient walls or foundations can apply new algorithms to the huge body of existing data to suddenly reveal ancient buildings and cities that were previously unknown.

DNA sequencing has had a similar history, starting from the laborious sequencing of tiny bits of known genomes that could be analyzed by eye (like hand-drawn maps), to the targeting of specific organism genomes to be completely sequenced and then analyzed (similar to aerial photography), to the modern practice of high-throughput sequencing, in which researchers might sequence an entire bacterial genome to study only one gene because it’s easier and cheaper to just measure the whole thing.

However, the significant difference in this analogy is that the ability to search, analyze, or develop new algorithms to explore the huge corpus of high-throughput sequence data is not yet a routine practice accessible to most scientists — as it is for Earth-orbiting satellite data.

Today, scientists expect to be able to routinely explore the entire corpus of targeted genome sequence data through tools such as NLM’s Basic Local Alignment Search Tool (BLAST); very little of the scientific work with genome data is looking for a specific GenBank record. The major scientific work is done by exploring the data in fast, meaningful ways, asking questions such as “Has anyone else seen a protein like this before?”; “What organism is most like the organism I’m working on?”; “Where else has a piece of sequence like this been collected?”; “Is anything known about the function of a piece of sequence like this?” But it has not been possible to do that for the high-throughput, unassembled sequence data, across all such sequences, because that corpus of data has been too big for all but a few places in the world to hold, or to compute across.

This is now changing.

With support from the National Institutes of Health (NIH) Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative, NLM’s National Center for Biotechnology Information (NCBI) has moved the publicly available high-throughput sequence data from its SRA archive onto two commercial cloud platforms, Google Cloud and Amazon Web Services. For the first time in history, it’s now possible for anyone to compute across this entire 5-petabyte corpus at will, with their own ideas and tools, opening the door to the kind of revolution that was sparked by the availability of a complete corpus of Earth-orbiting satellite images.

The public SRA data include genomes of viruses, bacteria, and nonhuman higher organisms, as well as gene expression data, metagenomes, and a small amount of human genome data that is consented to be public (from the 1000 Genomes Project). NCBI has held, and will continue to hold, codeathons to introduce small groups of scientists to exploring these data in the cloud. For example, during a recent codeathon, participants worked with a set of metagenomes to try to identify known and novel viruses. Other upcoming codeathon cloud topics include RNA-seq, pangenomics, haplotype annotation, and prokaryotic annotation.

Now that the publicly available SRA data are in the cloud, the next milestone is to make all of SRA’s controlled-access human genomic data available on both cloud platforms. Providing access to these data requires a higher level of security and oversight than is required for the nonhuman and publicly available human data, and access must be accompanied by a platform for the authentication and authorization of users, which creates a host of other issues to address. This effort is being undertaken in concert with other major NIH human genome repositories, with guidance from NIH leadership, and with international groups such as the Global Alliance for Genomics and Health (GA4GH).

But, already, the publicly available SRA data are there for biological and computational scientists to take their first dive into the new world of sequence-based “Earth-orbiting satellite photography.” More and more — in research, in clinical practice, in epidemiology, in public health surveillance, in agriculture, in ecology and species diversity — we’ve seen the movement to “just sequence the whole thing.” Now we’ve taken the first step toward the necessary corollary: to “analyze it all afterward.”

In the coming weeks and months, NLM will be making further announcements about SRA in the cloud, with tutorials and updates on the availability of controlled-access human data. For those already familiar with operating on commercial clouds who would like a look at the SRA data in the cloud, you can get started today via the updated SRA toolkit.


Dr. Ostell has had a leadership position at NCBI since its inception in 1988. Before assuming the role of Director in 2017, he was Chief of NCBI’s Information Engineering Branch, where he was responsible for designing, developing, building, and deploying the majority of production resources at NCBI, including flagship products such as PubMed and GenBank. Dr. Ostell was inducted into the United States National Academies, Institute of Medicine, in 2007 and made an NIH Distinguished Investigator in 2011.

To stay up to date on NCBI projects and research, follow us on Twitter.


Taking Flight: NLM’s Data Science Journey

Guest post by the Data Science @NLM Training Program team.

Data science at NLM is ready to soar!

In 2018, we embarked on a journey to build a workforce ready to take on the challenges of data-driven research and health, and earlier this year we shared our plans for accelerating data science expertise at NLM. Now, it’s time to reflect on our progress and recognize our accomplishments.

Our Data Science @NLM Training Program Open House, held last week, showcased some of the great data science work happening across the Library. We learned from each other and discovered new opportunities to strengthen the Library’s proficiencies in working with data and using analytic tools, furthering NLM’s research practices and services.

Data Science @NLM Poster Gallery

A poster gallery featuring 77 research posters and data visualizations provided a snapshot of the many ways that NLM staff apply data science to their work. It was great to see so many NLM staff sharing their work and engaging in stimulating conversations about innovation.

Three “lightning” presentations gave a glimpse of how NLM staff use data science. NLM Data Science and Open Science Librarian, Lisa Federer, PhD, MLIS, talked about building a librarian workforce to engage with researchers on open science and data science. NLM’s Rezarta Islamaj, PhD, and Donald Comeau, PhD, presented their perspectives on enriching gene and chemical links in PubMed and PubMedCentral and evaluating Medical Subject Headings, or MeSH in indexing for literature retrieval in PubMed.

The open house was also an opportunity for NLM staff who participated in an intensive 120-hour data science fundamentals course to share what they learned and how they’re applying their new skills.  

But this event was more than a celebration of accomplishments. It provided space to reflect on lessons learned, how to use what we’ve learned on a daily basis, and hopes for the future of data science at NLM. Dina Demner-Fushman, MD, PhD, of NLM dove into data science methodologies in her discussion of the Biomedical Citation Selector (BmCS), a high-recall machine-learning system that identifies articles that require indexing for MEDLINE selectively-indexed journals.

Data Science @NLM Ideas Booth

NLM staff brainstormed over 60 ideas to bring data science solutions to new and ongoing projects and talked with data science experts at the open house “ideas booth.” Staff also shared how they will learn, or continue to use, data science in support of their individual career goals.

We were delighted to see over 300 NLM staff participating in the open house, which is just one of the ways that NLM is working to achieve goal 3 of the NLM strategic plan to “build a workforce for data-driven research and health.”

The Data Science @NLM Training Program has helped increase NLM staff awareness of and expertise in data science. NLM staff are now better prepared than ever to demonstrate the Library’s commitment to accelerating biomedical discovery and data-powered health.   

Our data science journey continues, as does the growth of the data science community at NLM. For a recap of the day, follow the experience at #datareadynlm.

We’re taking off!


Photos of the Data Science at NLM Training Program Team; Dianne Babski, Peter Cooper, Lisa Federer and Anna Ripple
Data Science @NLM Training Program team (left to right):
Dianne Babski, Deputy Associate Director, Library Operations
Peter Cooper, Strategic Communications Team Lead, National Center for Biotechnology Information
Lisa Federer, Data Science and Open Science Librarian, Office of Strategic Initiatives
Anna Ripple, Information Research Specialist, Lister Hill National Center for Biomedical
Communications