Using Comparative Genomics to Advance Scientific Discoveries

Guest post by Valerie Schneider, PhD, staff scientist at the National Library of Medicine’s National Center for Biotechnology Information, National Institutes of Health.

In a post from earlier this year, A Journey to Spur Innovation and Discovery, I shared news of an exciting NIH-supported NLM initiative, now known as the NIH Comparative Genomics Resource (CGR). CGR, which supports eukaryotic organisms, is modernizing NIH resources and infrastructure to support research involving non-human organisms. This initiative will improve the data foundational to analyses that rely on comparisons of diverse genomes in NLM databases, increase its connectivity to related content, and facilitate the discovery and retrieval of this information. Just as researchers look to the data from these organisms to teach them about a wide range of fundamental biological processes underpinning human health, NLM relies on the research community to help inform the development and delivery of organism-agnostic core tools and interfaces for CGR so that it can best support these analyses.

Stakeholder feedback and engagement is central to the vision and ethos of the NLM Strategic Plan 2017-2027. Since the plan’s inception, NLM enterprises undertaken in support of our three primary goals have placed heavy emphasis on community connections in both their planning and execution. Likewise, understanding stakeholder needs is a fundamental element of CGR. With more than 19,000 genomes from over 8,500 species (excluding bacteria and viruses) found in our Assembly database, it’s clear that CGR’s user base will hail from a large and diverse collection of research organism communities. Within each community, there is diversity in the role CGR will play due to variability in the amount of genomic sequence available, as well as the existence of organism-specific data resources, such as community knowledge bases. Data consumers, themselves, are a heterogeneous population and represent different levels of research interests, education, bioinformatics expertise, and analysis needs.

CGR is using a multi-tiered and multi-faceted approach to ensure stakeholder requirements are understood and appropriately prioritized throughout the project duration. CGR is working to identify community-supplied genome-related data that can be integrated to enhance content supplied by NLM. Two governance bodies are playing important roles in this effort. A trans-NIH CGR steering committee provides strategic oversight by guiding CGR with respect to the priorities of NIH institutional stakeholders, and an NLM Board of Regents CGR working group is charged with helping engage with the scientific community and enlist them as partners in the development effort. Working group members have expertise in topics relevant to the CGR initiative, such as comparative genomic analysis, emerging large-scale genomics approaches, organism-centered research into general biological or disease processes, biological education, and workforce development.

We are developing a presence for CGR at scientific conferences and workshops to encourage partnerships with members of research communities and connect with attendees. A CGR-related talk given at the BioDiversity Genomics 2021 conference in September introduced a new cloud-based tool for improving genomic quality to be released in 2022 and identified researchers to serve as beta testers. Additional targeted outreach will be held independent of conferences to gather feedback and inform development.

The CGR project utilizes an iterative development process in which user testing is an integral element. Feedback gathered through these testing exercises is incorporated into the next development cycle. This approach ensures we remain engaged with the CGR target audience throughout the project by understanding their needs and providing a resource that is valuable to their research pursuits. For example, recent user testing of a prototype Basic Local Alignment Search Tool (BLAST) database engineered to support sequence queries seeking a broad distribution of organisms in the results taught us about other content that will need to be provided for proper interpretation of results.

NLM is poised to learn great things from our users as part of the CGR project. You can learn more about engagement opportunities by contacting us at info@ncbi.nlm.nih.gov. We value your input as we continue this journey together.

Valerie Schneider, PhD, is the deputy director of Sequence Offerings and the head of the Sequence Plus program. In these roles, she coordinates efforts associated with the curation, enhancement, and organization of sequence data, as well as oversees tools and resources that enable the public to access, analyze, and visualize biomedical data. She also manages NCBI’s involvement in the Genome Reference Consortium, the international collaboration tasked with maintaining the value of the human reference genome assembly.

Pursuing Data-Driven Responses to Public Health Threats

In my 11th grade civics class, I learned about how a bill becomes a law, and I‘ll bet some of you can even remember the steps. Today, I want to introduce you to another way that the federal government takes actions – executive orders. As head of the executive branch, the president can issue an executive order to manage operations of the federal government.

In light of the COVID-19 pandemic, President Biden has issued executive orders to accelerate the country’s ability to respond to public health threats.

This is where I come in. As Director of the National Library of Medicine (NLM) and a member of the leadership team of the National Institutes of Health, I’m part of a group developing the implementation plan for the Executive Order entitled Ensuring a Data-Driven Response to COVID-19 and Future High-Consequence Public Health Threats.

This order directs the heads of all executive departments and agencies to work on COVID-19 and pandemic-related data issues. This includes making data that is relevant to high-consequence public health threats accessible to everyone, reviewing existing public health data systems to issue recommendations for addressing areas for improvement, and reviewing the workforce capacity for advanced information technology and data management. And, like all good government work, a report summarizing findings and providing recommendations will be issued.

Since March 2021, I have been meeting 2 to 3 times a month with public health and health data experts across the U.S. Department of Health & Human Services (HHS). Our committee includes staff from the Office of the National Coordinator for Health Information Technology, Food and Drug Administration, Centers for Disease Control and Prevention, Centers for Medicare & Medicaid Services, and Office of the Assistant Secretary for Planning and Evaluation.

After creating a work plan, our group arranged briefings with many other groups, including public health officials from states and territories, representatives from major health care systems, and the public, among others. We reviewed many initiatives to promote open data, data sharing, and data protection across the government sphere. We learned about the challenges of developing and adopting data standards, and the ability of different groups to come together to make data more useful in preparing the country to anticipate and respond to high-consequence public health threats. We discussed future strategies for data management and data protection, new analytical models, and workforce development initiatives. Our working group provided a report to the Office of Science and Technology Policy (OSTP), handing it off to the next team who will take the work process and keep moving it toward completion. In coordination with the National Science and Technology Council, OSTP will develop a plan for advancing innovation in public health data and analytics.

This was a beneficial experience for me, and I certainly learned a great deal. Implementing a public health response system requires engagement with many HHS divisions, each of which brings a unique perspective and experience. I also developed new relationships based on trust and collaboration with these colleagues. At NLM, we have experts in data standards and data collection, and we oversee vast data repositories, so we have substantial domain-specific knowledge to contribute. I drew frequently on the knowledge and expertise of NLM staff to inform the process through analyses of information and the preparation of reports. I am grateful for all who helped and supported me.

I believe our country is prepared to have the data necessary to prevent, detect, and respond to future high-consequence public health threats. This is yet another way that NLM is helping shape data-powered health for the future. What else can we do for you?

What Did You Do with Your Summer Vacation?

Well, if you are spending the summer at the NIH, you’ve likely been engaged in one of our many activities designed to access critical data and advance our understanding of the human experience by linking data sets together. Today, we are inviting you to engage in some additional best practices in accessing controlled data in ways that support science and preserve privacy.

In 2020, the NIH Scientific Data Council charged its Working Group for Streamlining Access to Controlled Data to spend a year engaging in dialogue within the NIH and with our extramural colleagues to better understand the experiences of scientists and the strategies that both facilitate and impede access to data. The group also considered where in the research process NIH should inform, engage, and gain consent of participants sufficiently to support science driven by access to controlled datasets.

NIH stores and facilitates access to many datasets, both open and controlled, with the goal of accelerating new discoveries and thereby maximizing taxpayer return on investment in the collection of these datasets. Data derived from humans that are shared through controlled-access mechanisms reflect NIH’s commitment to protect sensitive data and honor the informed consent provided by research participants in NIH-supported studies.

NIH has supported multiple controlled-access data repositories that uphold appropriate data protections for both human data and other sensitive data, while meeting the needs of various researcher communities. However, as data access requests increase, new repositories are established, and new mechanisms of providing access to data are developed, it is apparent that opportunities remain to improve efficiency and harmonization among repositories to make NIH-supported controlled-access data more FAIR: Findable, Accessible, Interoperable, and Reusable and to ensure appropriate oversight when data from different resources are combined. While these trends are enabling datasets and datatypes to be combined in new ways that advance the science, datasets, and datatypes that may or may not be controlled may, when combined, create inadvertent re-identification risks.

To help the agency address these issues in a way that is responsive to community needs, we are hosting a series of webinars through the end of July. We call these “breakout sessions” because they follow an outstanding webinar presented on July 9 available here. Richard Hodes, MD, director of the National Institute on Aging, launched the 3-hour seminar with a talk titled Opportunities for Advancing Research Through Better Access to Controlled Data. Ana Navas-Acien, MD, PhD, brought the perspective of indigenous and communities of people traditionally underrepresented in research, and she emphasized themes of community engagement and broadening the consent framework to consider community-level accountabilities as well as individual assent. Lucila Ohno-Machado, MD, MBA, PhD, addressed privacy preserving distributed analytics as a strategy to promote science while preserving privacy of data. Hoon Cho, PhD, described privacy-enhancing computational approaches to privacy preservation.

You can find the schedule for the breakout sessions below. These sessions are specifically designed to listen to the expectations, hopes, and concerns from researchers and participants. These webinars are free and open to the public; registration is required.

Breakout Session on “Making Controlled-Access Data Readily Findable and Accessible” on July 22 from 3 pm to 5:30 pm EST

Breakout Session on “General Opportunities for Streamlining Access to Controlled Data” on July 26 from 12:30 pm to 2 pm EST

Breakout Session on “Addressing Oversight, Governance, and Privacy Issues in Linking Controlled Access Data from Different Resources” on July 28 from 3 pm to 5:30 pm EST

To generate interest and hear from the broadest possible group of stakeholders, NIH has released a Request for Information on Streamlining Access to Controlled Data from NIH Data Repositories. Please note the closing date is August 9. We look forward to hearing from you! Please visit Streamlining Access to Controlled Data at the NIH for all of the information described in this post.

Finally, we would like to personally thank the many NIH staff members who serve on the working group:

  • Shu Hui Chen
  • Alicia Chou
  • Valentina Di Francesco
  • Greg Farber
  • Jamie Guidry Auvil
  • Nicole Garbarini
  • Lyric Jorgenson
  • Punam Mathur
  • Vivian Ota Wang
  • Jonathan Pollock
  • Rebecca Rodriguez
  • Alex Rosenthal
  • Steve Sherry
  • Julia Slutsman
  • Erin Walker
  • Alison Yao

I hope your summer vacation was as productive as ours!

(left to right)
Patricia Flatley Brennan, RN, PhD, NLM Director
Susan Gregurick, PhD, Associate Director for Data Science at NIH
Hilary S. Leeds, JD, Senior Health Science Policy Analyst for the Office of Science Policy at NIH

Data Science @ NLM Journey Continues and What We Have Learned!

Guest post by the Data Science @ NLM Training Program team.

As part of our effort to advance Goal 3 of the NLM Strategic Plan (“Build a workforce for data driven research and health”), NLM launched the Data Science @ NLM (DS@NLM) Training Program in 2019 to help ensure that all staff are prepared to engage with and participate in NLM’s developing data science efforts.

Our efforts have stayed on track despite the changes caused by the COVID-19 pandemic, and we’re proud to highlight DS@NLM events held during the past year. We’re also sharing lessons learned throughout the training program, which are applicable to any individual or organization trying to help develop data science skills in the fields of health and biomedical information.

Earlier this month, we marked two years of the DS@NLM Training Program with a Spring Fling series of virtual events celebrating the data science training achievements of NLM staff.

Our Spring Fling kicked off with “lightning talk” presentations featuring several graduates of our intensive Data Science Fundamentals course, who shared their final class projects with NLM colleagues. Participants in our year-long Data Science Mentorship program also had the opportunity to present their Capstone projects. Our program mentees, who were mentored by NLM staff members, developed their data science skills by completing projects that applied data science techniques to help improve NLM operations.

What We’ve Learned:

Be responsive to specific needs; one size does NOT fit all.

Data plays a role in virtually everything we do at NLM, and as we aim to provide data training opportunities for staff working in many different areas, we recognize that different staff members have unique training needs. New training opportunities for some staff, such as our researchers, may hinge on their knowledge of machine learning. Metadata specialists may have more need for data cleaning or text processing skills, while administrators may benefit more from learning about data visualization.

People also learn in different ways, be it through shorter webinars and workshops, longer intensive courses, or self-directed learning. The DS@NLM program provides a variety of activities to meet these needs, including opportunities for various skill levels and topics, from short webinars to on-demand classes to ten-week intensive training courses.

Be responsive to staff feedback; give people what they ask for.

To help us determine what to offer, we engaged directly with our audience, asking NLM staff what they needed and listening to their responses. Because of the wide variety of work done at NLM, receiving feedback from staff helped us better understand their specific training needs. While we cannot always offer individualized programs to meet every need, staff feedback always helps us discover new ideas for future programming.

Teaching skills is just the beginning; applying new skills is essential.

A key lesson learned from staff feedback is that teaching new data skills is important, but that’s not enough on its own; teaching how to put newly acquired data skills to use in the real world or applying it to their work is just as important. Helping staff learn to apply data science techniques to their work transforms this new knowledge from theoretical to practical. The Data Science Mentorship Program, with its concluding Capstone project, is a great example of an opportunity for staff to both develop skills and practice applying them.

We applaud and celebrate all the hardworking staff from across NLM who have taken advantage of these training opportunities to advance the goal of building a workforce for data driven research and health, both at NLM and throughout the biomedical and health sciences information world.

Share with us and others how you are helping your staff apply data science skills in your organization—do you have any lessons learned?

Data Science @ NLM Training Program team
 
Top Row (left to right)
Dianne Babski, Associate Director, Library Operations
Maria Collins, Data & Systems Liaison, Office of the Associate Director for Library Operations
Peter Cooper, Strategic Communications Team Lead, National Center for Biotechnology Information

Bottom Row (left to right):
Mike Davidson, Librarian, Office of Engagement and Training, Division of Library Operations
Lisa Federer, NLM Data Science and Open Science Librarian, Office of Strategic Initiatives
Anna Ripple, Information Research Specialist, Lister Hill National Center for Biomedical Communications

Upcoming Training Opportunity: University-based Training for Research Careers in Biomedical Informatics and Data Science

Guest blog by Valerie Florance, PhD, Director of NLM’s Division of Extramural Programs

Explore the Training

NLM’s Extramural Programs Division is a principle source of NIH funding for research training in biomedical informatics, applying approaches in computer and information science to challenges in basic biomedical research, health care, and public health administration. NLM’s support fundamentally shapes the education, training, and advancement of biomedical informatics nationally. For decades, NLM has sponsored university-based training for predoctoral and postdoctoral fellows to prepare them for research careers. These programs support NLM’s long-term investment strategy to help influence and impact the field of biomedical informatics and data science.

Last October, NLM published NOT-LM-21-001 in the NIH Guide for Grants and Contracts to allow potential applicants sufficient time to develop meaningful collaborations and responsive projects. This program, a model among NIH training programs, advances training with big data in biomedical informatics and produces interdisciplinary, researchers that fully comprehend the challenges of knowledge representation, decision support, translational research, human-computer interaction, and social and organizational factors that influence effective adoption of health information technology in biomedical domains. This notice was the first step in a year-long process that will result in new 5-year grant awards that begin in July 2022. You’ll find the notice outlines the expected timetable for publishing the funding opportunity announcement, accepting applications, reviewing them and making awards.

The solicitation for new applications will be published in the NIH Guide for Grants and Contracts in March with applications due in May. For those interested in applying for an NLM training grant for the first time, we encourage a review of the previous solicitation to get a sense of the data and programmatic descriptions that are required for a training grant application.

Because issuance dates for the next competition are estimates, it is also helpful to subscribe to the weekly Table of Contents emails from the NIH Guide for Grants and Contracts. The extra benefit of this weekly mailing is that it lists all new funding issuances from NIH plus important notices about policy changes.

A Strong Foundation

NLM’s training programs offer graduate education and postdoctoral research experiences in a wide range of areas including health care informatics, translational bioinformatics, clinical research informatics, public health informatics, and biomedical data science. Each of these programs offer a combination of core curriculum and electives. In the current 5-year cycle, seven programs also offer special tracks in environmental exposure informatics supported by NIH’s National Institute of Environmental and Health Sciences.

A decades-old project, the university-based training initiatives is one of NLM’s signature grant programs. NLM’s training programs have produced many leaders in the field of biomedical informatics. Past trainees have taken positions in academia, industry, small businesses, health care organizations, and government. Currently, NLM supports 200 trainee positions at 16 universities around the United States and provides funding each year for up to 40 short-term trainee positions that are used to help recruit college graduates to our field by providing introductory training and research opportunities. To develop a sense of community among the trainees, NLM brings its trainees together each year, apart from those falling within a pandemic year, for an annual conference hosted at one of the university sites.

You can find a map with links to descriptions of the current programs here. The website also provides links to information about past annual conferences – check out past agendas to get a sense of the broad scope of science across the field of biomedical informatics.

Attendees comparing notes at NLM Informatics Training Conference 2017 in La Jolla, California

Did you take part in this training? What was your favorite thing about this experience? What advice would you give to current students? How can we make the program even better?

 Dr. Florance heads NLM’s Extramural Programs Division, which is responsible for the Library’s grant programs and coordinates NLM’s informatics training programs.