Guest post by Jim Ostell, PhD, Director of the National Library of Medicine’s National Center for Biotechnology Information, National Institutes of Health.
NLM’s Sequence Read Archive (SRA) is used by more than 100,000 researchers every month, and those researchers now have a tremendous new opportunity to query this database of high-throughput sequence data in new ways for novel discovery: via the cloud. NLM has just finished moving SRA’s public data to the cloud, completing the first phase of an ongoing effort to better position these data for large-scale computing.
To understand the importance of this move, it’s helpful to consider the analogy of how humans slowly improved their knowledge of the surface of the Earth.
The first simple maps allowed knowledge of terrain to be passed from people who had been there to those who hadn’t. Over the centuries, we learned to sail ships over the oceans and capture new knowledge in navigation charts and techniques. And we learned to fly airplanes over an area of interest and automatically capture detailed images of not only terrain, but also buildings and reservoirs, and assess the conditions of forest, field, and agricultural resources.
Today, with Earth-orbiting satellites, we no longer need to determine in advance what we want to view. We just photograph the whole Earth, all day, every day, and store all the data in a big database. Then we mine the data afterward. The significant change here is that not only can we follow, in great detail, locations or features on the Earth that we already know we’re interested in, as in aerial photography, but we can also discover new things of interest. Examples abound: for instance, noticing a change in a military base, and going back in time to see when the change began or how it developed; or seeing a decline in a forest or watershed, going back in time to see how this decline developed, and then looking geographically to see if it’s happening in other places in the world.
Scientists also can develop new algorithms to extract information from the corpus, or collection, of information. For example, archeologists looking for faint straight-line features indicative of ancient walls or foundations can apply new algorithms to the huge body of existing data to suddenly reveal ancient buildings and cities that were previously unknown.
DNA sequencing has had a similar history, starting from the laborious sequencing of tiny bits of known genomes that could be analyzed by eye (like hand-drawn maps), to the targeting of specific organism genomes to be completely sequenced and then analyzed (similar to aerial photography), to the modern practice of high-throughput sequencing, in which researchers might sequence an entire bacterial genome to study only one gene because it’s easier and cheaper to just measure the whole thing.
However, the significant difference in this analogy is that the ability to search, analyze, or develop new algorithms to explore the huge corpus of high-throughput sequence data is not yet a routine practice accessible to most scientists — as it is for Earth-orbiting satellite data.
Today, scientists expect to be able to routinely explore the entire corpus of targeted genome sequence data through tools such as NLM’s Basic Local Alignment Search Tool (BLAST); very little of the scientific work with genome data is looking for a specific GenBank record. The major scientific work is done by exploring the data in fast, meaningful ways, asking questions such as “Has anyone else seen a protein like this before?”; “What organism is most like the organism I’m working on?”; “Where else has a piece of sequence like this been collected?”; “Is anything known about the function of a piece of sequence like this?” But it has not been possible to do that for the high-throughput, unassembled sequence data, across all such sequences, because that corpus of data has been too big for all but a few places in the world to hold, or to compute across.
This is now changing.
With support from the National Institutes of Health (NIH) Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative, NLM’s National Center for Biotechnology Information (NCBI) has moved the publicly available high-throughput sequence data from its SRA archive onto two commercial cloud platforms, Google Cloud and Amazon Web Services. For the first time in history, it’s now possible for anyone to compute across this entire 5-petabyte corpus at will, with their own ideas and tools, opening the door to the kind of revolution that was sparked by the availability of a complete corpus of Earth-orbiting satellite images.
The public SRA data include genomes of viruses, bacteria, and nonhuman higher organisms, as well as gene expression data, metagenomes, and a small amount of human genome data that is consented to be public (from the 1000 Genomes Project). NCBI has held, and will continue to hold, codeathons to introduce small groups of scientists to exploring these data in the cloud. For example, during a recent codeathon, participants worked with a set of metagenomes to try to identify known and novel viruses. Other upcoming codeathon cloud topics include RNA-seq, pangenomics, haplotype annotation, and prokaryotic annotation.
Now that the publicly available SRA data are in the cloud, the next milestone is to make all of SRA’s controlled-access human genomic data available on both cloud platforms. Providing access to these data requires a higher level of security and oversight than is required for the nonhuman and publicly available human data, and access must be accompanied by a platform for the authentication and authorization of users, which creates a host of other issues to address. This effort is being undertaken in concert with other major NIH human genome repositories, with guidance from NIH leadership, and with international groups such as the Global Alliance for Genomics and Health (GA4GH).
But, already, the publicly available SRA data are there for biological and computational scientists to take their first dive into the new world of sequence-based “Earth-orbiting satellite photography.” More and more — in research, in clinical practice, in epidemiology, in public health surveillance, in agriculture, in ecology and species diversity — we’ve seen the movement to “just sequence the whole thing.” Now we’ve taken the first step toward the necessary corollary: to “analyze it all afterward.”
In the coming weeks and months, NLM will be making further announcements about SRA in the cloud, with tutorials and updates on the availability of controlled-access human data. For those already familiar with operating on commercial clouds who would like a look at the SRA data in the cloud, you can get started today via the updated SRA toolkit.
Dr. Ostell has had a leadership position at NCBI since its inception in 1988. Before assuming the role of Director in 2017, he was Chief of NCBI’s Information Engineering Branch, where he was responsible for designing, developing, building, and deploying the majority of production resources at NCBI, including flagship products such as PubMed and GenBank. Dr. Ostell was inducted into the United States National Academies, Institute of Medicine, in 2007 and made an NIH Distinguished Investigator in 2011.
To stay up to date on NCBI projects and research, follow us on Twitter.