Upcoming Training Opportunity: University-based Training for Research Careers in Biomedical Informatics and Data Science

Guest blog by Valerie Florance, PhD, Director of NLM’s Division of Extramural Programs

Explore the Training

NLM’s Extramural Programs Division is a principle source of NIH funding for research training in biomedical informatics, applying approaches in computer and information science to challenges in basic biomedical research, health care, and public health administration. NLM’s support fundamentally shapes the education, training, and advancement of biomedical informatics nationally. For decades, NLM has sponsored university-based training for predoctoral and postdoctoral fellows to prepare them for research careers. These programs support NLM’s long-term investment strategy to help influence and impact the field of biomedical informatics and data science.

Last October, NLM published NOT-LM-21-001 in the NIH Guide for Grants and Contracts to allow potential applicants sufficient time to develop meaningful collaborations and responsive projects. This program, a model among NIH training programs, advances training with big data in biomedical informatics and produces interdisciplinary, researchers that fully comprehend the challenges of knowledge representation, decision support, translational research, human-computer interaction, and social and organizational factors that influence effective adoption of health information technology in biomedical domains. This notice was the first step in a year-long process that will result in new 5-year grant awards that begin in July 2022. You’ll find the notice outlines the expected timetable for publishing the funding opportunity announcement, accepting applications, reviewing them and making awards.

The solicitation for new applications will be published in the NIH Guide for Grants and Contracts in March with applications due in May. For those interested in applying for an NLM training grant for the first time, we encourage a review of the previous solicitation to get a sense of the data and programmatic descriptions that are required for a training grant application.

Because issuance dates for the next competition are estimates, it is also helpful to subscribe to the weekly Table of Contents emails from the NIH Guide for Grants and Contracts. The extra benefit of this weekly mailing is that it lists all new funding issuances from NIH plus important notices about policy changes.

A Strong Foundation

NLM’s training programs offer graduate education and postdoctoral research experiences in a wide range of areas including health care informatics, translational bioinformatics, clinical research informatics, public health informatics, and biomedical data science. Each of these programs offer a combination of core curriculum and electives. In the current 5-year cycle, seven programs also offer special tracks in environmental exposure informatics supported by NIH’s National Institute of Environmental and Health Sciences.

A decades-old project, the university-based training initiatives is one of NLM’s signature grant programs. NLM’s training programs have produced many leaders in the field of biomedical informatics. Past trainees have taken positions in academia, industry, small businesses, health care organizations, and government. Currently, NLM supports 200 trainee positions at 16 universities around the United States and provides funding each year for up to 40 short-term trainee positions that are used to help recruit college graduates to our field by providing introductory training and research opportunities. To develop a sense of community among the trainees, NLM brings its trainees together each year, apart from those falling within a pandemic year, for an annual conference hosted at one of the university sites.

You can find a map with links to descriptions of the current programs here. The website also provides links to information about past annual conferences – check out past agendas to get a sense of the broad scope of science across the field of biomedical informatics.

Attendees comparing notes at NLM Informatics Training Conference 2017 in La Jolla, California

Did you take part in this training? What was your favorite thing about this experience? What advice would you give to current students? How can we make the program even better?

 Dr. Florance heads NLM’s Extramural Programs Division, which is responsible for the Library’s grant programs and coordinates NLM’s informatics training programs. 

Biomedical Discovery through SRA and the Cloud

Guest post by Jim Ostell, PhD, Director of the National Library of Medicine’s National Center for Biotechnology Information, National Institutes of Health.

NLM’s Sequence Read Archive (SRA) is used by more than 100,000 researchers every month, and those researchers now have a tremendous new opportunity to query this database of high-throughput sequence data in new ways for novel discovery: via the cloud. NLM has just finished moving SRA’s public data to the cloud, completing the first phase of an ongoing effort to better position these data for large-scale computing.  

To understand the importance of this move, it’s helpful to consider the analogy of how humans slowly improved their knowledge of the surface of the Earth.

The first simple maps allowed knowledge of terrain to be passed from people who had been there to those who hadn’t. Over the centuries, we learned to sail ships over the oceans and capture new knowledge in navigation charts and techniques. And we learned to fly airplanes over an area of interest and automatically capture detailed images of not only terrain, but also buildings and reservoirs, and assess the conditions of forest, field, and agricultural resources.

Today, with Earth-orbiting satellites, we no longer need to determine in advance what we want to view. We just photograph the whole Earth, all day, every day, and store all the data in a big database. Then we mine the data afterward. The significant change here is that not only can we follow, in great detail, locations or features on the Earth that we already know we’re interested in, as in aerial photography, but we can also discover new things of interest. Examples abound: for instance, noticing a change in a military base, and going back in time to see when the change began or how it developed; or seeing a decline in a forest or watershed, going back in time to see how this decline developed, and then looking geographically to see if it’s happening in other places in the world.

Scientists also can develop new algorithms to extract information from the corpus, or collection, of information. For example, archeologists looking for faint straight-line features indicative of ancient walls or foundations can apply new algorithms to the huge body of existing data to suddenly reveal ancient buildings and cities that were previously unknown.

DNA sequencing has had a similar history, starting from the laborious sequencing of tiny bits of known genomes that could be analyzed by eye (like hand-drawn maps), to the targeting of specific organism genomes to be completely sequenced and then analyzed (similar to aerial photography), to the modern practice of high-throughput sequencing, in which researchers might sequence an entire bacterial genome to study only one gene because it’s easier and cheaper to just measure the whole thing.

However, the significant difference in this analogy is that the ability to search, analyze, or develop new algorithms to explore the huge corpus of high-throughput sequence data is not yet a routine practice accessible to most scientists — as it is for Earth-orbiting satellite data.

Today, scientists expect to be able to routinely explore the entire corpus of targeted genome sequence data through tools such as NLM’s Basic Local Alignment Search Tool (BLAST); very little of the scientific work with genome data is looking for a specific GenBank record. The major scientific work is done by exploring the data in fast, meaningful ways, asking questions such as “Has anyone else seen a protein like this before?”; “What organism is most like the organism I’m working on?”; “Where else has a piece of sequence like this been collected?”; “Is anything known about the function of a piece of sequence like this?” But it has not been possible to do that for the high-throughput, unassembled sequence data, across all such sequences, because that corpus of data has been too big for all but a few places in the world to hold, or to compute across.

This is now changing.

With support from the National Institutes of Health (NIH) Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative, NLM’s National Center for Biotechnology Information (NCBI) has moved the publicly available high-throughput sequence data from its SRA archive onto two commercial cloud platforms, Google Cloud and Amazon Web Services. For the first time in history, it’s now possible for anyone to compute across this entire 5-petabyte corpus at will, with their own ideas and tools, opening the door to the kind of revolution that was sparked by the availability of a complete corpus of Earth-orbiting satellite images.

The public SRA data include genomes of viruses, bacteria, and nonhuman higher organisms, as well as gene expression data, metagenomes, and a small amount of human genome data that is consented to be public (from the 1000 Genomes Project). NCBI has held, and will continue to hold, codeathons to introduce small groups of scientists to exploring these data in the cloud. For example, during a recent codeathon, participants worked with a set of metagenomes to try to identify known and novel viruses. Other upcoming codeathon cloud topics include RNA-seq, pangenomics, haplotype annotation, and prokaryotic annotation.

Now that the publicly available SRA data are in the cloud, the next milestone is to make all of SRA’s controlled-access human genomic data available on both cloud platforms. Providing access to these data requires a higher level of security and oversight than is required for the nonhuman and publicly available human data, and access must be accompanied by a platform for the authentication and authorization of users, which creates a host of other issues to address. This effort is being undertaken in concert with other major NIH human genome repositories, with guidance from NIH leadership, and with international groups such as the Global Alliance for Genomics and Health (GA4GH).

But, already, the publicly available SRA data are there for biological and computational scientists to take their first dive into the new world of sequence-based “Earth-orbiting satellite photography.” More and more — in research, in clinical practice, in epidemiology, in public health surveillance, in agriculture, in ecology and species diversity — we’ve seen the movement to “just sequence the whole thing.” Now we’ve taken the first step toward the necessary corollary: to “analyze it all afterward.”

In the coming weeks and months, NLM will be making further announcements about SRA in the cloud, with tutorials and updates on the availability of controlled-access human data. For those already familiar with operating on commercial clouds who would like a look at the SRA data in the cloud, you can get started today via the updated SRA toolkit.


Dr. Ostell has had a leadership position at NCBI since its inception in 1988. Before assuming the role of Director in 2017, he was Chief of NCBI’s Information Engineering Branch, where he was responsible for designing, developing, building, and deploying the majority of production resources at NCBI, including flagship products such as PubMed and GenBank. Dr. Ostell was inducted into the United States National Academies, Institute of Medicine, in 2007 and made an NIH Distinguished Investigator in 2011.

To stay up to date on NCBI projects and research, follow us on Twitter.


Taking Flight: NLM’s Data Science Journey

Guest post by the Data Science @NLM Training Program team.

Data science at NLM is ready to soar!

In 2018, we embarked on a journey to build a workforce ready to take on the challenges of data-driven research and health, and earlier this year we shared our plans for accelerating data science expertise at NLM. Now, it’s time to reflect on our progress and recognize our accomplishments.

Our Data Science @NLM Training Program Open House, held last week, showcased some of the great data science work happening across the Library. We learned from each other and discovered new opportunities to strengthen the Library’s proficiencies in working with data and using analytic tools, furthering NLM’s research practices and services.

Data Science @NLM Poster Gallery

A poster gallery featuring 77 research posters and data visualizations provided a snapshot of the many ways that NLM staff apply data science to their work. It was great to see so many NLM staff sharing their work and engaging in stimulating conversations about innovation.

Three “lightning” presentations gave a glimpse of how NLM staff use data science. NLM Data Science and Open Science Librarian, Lisa Federer, PhD, MLIS, talked about building a librarian workforce to engage with researchers on open science and data science. NLM’s Rezarta Islamaj, PhD, and Donald Comeau, PhD, presented their perspectives on enriching gene and chemical links in PubMed and PubMedCentral and evaluating Medical Subject Headings, or MeSH in indexing for literature retrieval in PubMed.

The open house was also an opportunity for NLM staff who participated in an intensive 120-hour data science fundamentals course to share what they learned and how they’re applying their new skills.  

But this event was more than a celebration of accomplishments. It provided space to reflect on lessons learned, how to use what we’ve learned on a daily basis, and hopes for the future of data science at NLM. Dina Demner-Fushman, MD, PhD, of NLM dove into data science methodologies in her discussion of the Biomedical Citation Selector (BmCS), a high-recall machine-learning system that identifies articles that require indexing for MEDLINE selectively-indexed journals.

Data Science @NLM Ideas Booth

NLM staff brainstormed over 60 ideas to bring data science solutions to new and ongoing projects and talked with data science experts at the open house “ideas booth.” Staff also shared how they will learn, or continue to use, data science in support of their individual career goals.

We were delighted to see over 300 NLM staff participating in the open house, which is just one of the ways that NLM is working to achieve goal 3 of the NLM strategic plan to “build a workforce for data-driven research and health.”

The Data Science @NLM Training Program has helped increase NLM staff awareness of and expertise in data science. NLM staff are now better prepared than ever to demonstrate the Library’s commitment to accelerating biomedical discovery and data-powered health.   

Our data science journey continues, as does the growth of the data science community at NLM. For a recap of the day, follow the experience at #datareadynlm.

We’re taking off!


Photos of the Data Science at NLM Training Program Team; Dianne Babski, Peter Cooper, Lisa Federer and Anna Ripple
Data Science @NLM Training Program team (left to right):
Dianne Babski, Deputy Associate Director, Library Operations
Peter Cooper, Strategic Communications Team Lead, National Center for Biotechnology Information
Lisa Federer, Data Science and Open Science Librarian, Office of Strategic Initiatives
Anna Ripple, Information Research Specialist, Lister Hill National Center for Biomedical
Communications

Data in the Scholarly Communications Solar System

Guest post by Kathryn Funk, program manager for NLM’s PubMed Central.

The Library of the Future. What will it look like?  The NLM Strategic Plan envisions it partly as “one of connections between and among literature, data, models, and analytical tools.” In this future, journal articles are no longer lone objects drifting in space, but, rather, each a solar system waiting to be explored. Indeed, we’re already seeing the published literature associated with datasets, clinical trials, protocols, software, earlier versions (including preprints), peer review documents, and so on through consistent identifiers and standardized publishing and archival practices.

To help researchers and the public navigate this new solar system, PubMed Central (PMC), NLM’s full-text archive of journal literature, has been collaborating with publishers and funders for the last year to support efficient ways of linking journal articles with associated data. We’re encouraging authors to cite their open datasets and publishers to archive and make available those data citations in a machine-readable format. Though data citations represent only a small percentage of how PMC articles are linked to data (supplementary material continues to be the predominant method for associating data with articles in the archival record), the growth in data citations in the last year has been promising, nearly doubling the previous year’s total (i.e., 850 articles with data citations in 2017 vs.  approximately 440 in 2016). NLM is also supporting the public access policy requirements of our research funder partners by encouraging authors to deposit datasets as supporting documents via the NIH Manuscript Submission (NIHMS) system.

But solar systems, even the metaphorical kind, are meant to be explored, so we’re also working to expose each journal article solar system in a way that promotes discoverability. We want to make it easier to discover articles in PMC with associated data citations, data availability statements, and supplementary data, through improved record displays and new search facets, leveraging the data-related search filters announced earlier this year.

NLM is also looking beyond datasets to archive and expose articles’ key satellites, including, for example, comments generated during the peer review process. As the effort to expand the openness of peer review gains traction, PMC staff have been collaborating with publishers and Crossref on standardized ways to make readily available those peer review materials.

As with any exploration of new solar systems, it’s our hope that taking these steps will help generate new knowledge, and in so doing drive research that is reproducible, robust, transparent, and reusable. And as we move toward becoming the Library of the Future, how we can best support your research needs in connecting the literature with the rest of the research universe? Please let us know.

With thanks to Jeff Beck for the solar system analogy. 

casual headshot of Kathryn FunkKathryn Funk is the program manager for PubMed Central. She is responsible for PMC policy as well as PMC’s role in supporting the public access policies of numerous funding agencies, including NIH. Katie received her master’s degree in library and information science from The Catholic University of America.

Public-Private Partnerships Will Accelerate Data-Driven Discovery

Big news today from NIH announcing their new initiative to develop and test new ways to best implement cloud services in support of biomedical research. Called STRIDES for “Science and Technology Research Infrastructure for Discovery, Experimentation and Sustainability,” the initiative will allow NIH to explore the use of cloud environments to streamline NIH data use. By partnering with commercial cloud service providers, NIH expects to improve access to biomedical data and provide cost-effective cloud infrastructure, data storage, computation, and machine learning services for NIH and NIH-supported investigators.

I thank the many NLM staff members who contributed key knowledge to help shape this initiative, and I’m confident that NLM will have many opportunities to impact STRIDES’ success, whether by devising discovery mechanisms, developing tools for investigators, assigning metadata to incoming data sets, or developing effective ways to link those data sets to related publications.

The full NIH press release follows.

NIH makes STRIDES to accelerate discoveries in the cloud
Google Cloud first to join effort.

The National Institutes of Health has launched a new initiative to harness the power of commercial cloud computing and provide NIH biomedical researchers access to the most advanced, cost-effective computational infrastructure, tools and services available. The STRIDES (Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability) Initiative launches with Google Cloud as its first industry partner and aims to reduce economic and technological barriers to accessing and computing on large biomedical data sets to accelerate biomedical advances.

“NIH is in a unique position to bring together academic and innovation industry partners to create a biomedical data ecosystem that maximizes the use of NIH-supported biomedical research data for the greatest benefit to human health,” said NIH Principal Deputy Director Lawrence A. Tabak, DDS, PhD, who also serves as NIH’s interim Associate Director for Data Science. “The STRIDES Initiative aims to maximize the number of researchers working to provide the greatest number of solutions to advancing health and reducing the burden of disease.”

In line with NIH’s first-ever Data Science Strategic Plan released in June, STRIDES will establish additional innovative partnerships to broaden access to services and tools, including training for researchers to learn about the latest cloud tools and technologies. Services are expected to become available to the NIH-supported community after a series of pilot activities to refine policies and test and assess implementation approaches.

The initial agreement with Google Cloud creates a cost-efficient framework for NIH researchers, as well as researchers at more than 2,500 academic institutions across the nation receiving NIH support, to make use of Google Cloud’s storage, computing, and machine learning technologies. In addition, the partnership will involve collaborations with NIH’s Data Commons Pilot—a group of innovative projects testing new tools and methods for working with and sharing data in the cloud—and enable the establishment of training programs for researchers at NIH-funded institutions on how to use Google Cloud Platform.

“The volume of data generated in biomedical research labs across the world is growing exponentially,” said Gregory Moore, MD, PhD, Vice President, Healthcare, Google Cloud. “Through our partnership with NIH, we are bringing the power of data and the cloud to the biomedical research community globally. Together, we are making it easier for scientists and physicians to access and garner insights from NIH-funded data sets with appropriate privacy protections, which will ultimately accelerate biomedical research progress toward finding treatments and cures for the most devastating diseases of our time.”

A central tenet of STRIDES is that data made available through these partnerships will incorporate standards endorsed by the biomedical research community to make data Findable, Accessible, Interoperable, and Reusable (FAIR). NIH’s initial efforts will focus on making NIH high-value data sets more accessible through the cloud, leveraging partnerships to take advantage of data-related innovations such as machine learning and artificial intelligence, and experimenting with new ways to optimize technology-intensive research.

“By launching STRIDES, we clearly show our strong commitment to putting the most advanced cloud computing tools in the hands of scientists,” said Andrea T. Norris, NIH Chief Information Officer and director of NIH’s Center for Information Technology. “Beyond our partnership with Google Cloud, we will seek to add more industry partners to assure that NIH continues to be well poised to support the future of biomedical research.”