See last week’s guest post, “Adventures of a Computational Biologist in the Genome Space,” for Part 1 of Dr. Koonin’s musings on the importance of computational analysis in biomedical discovery.
While the genomic revolution rolls on, a new one has been quietly fomenting over the last decade or so, only to take over the science of microbiology in the last couple of years.
The name of this new game is metagenomics.
Metagenomics is concerned with the complex communities of microbes.
Traditionally, microbes have been studied in isolation, but to do that, a microbe or virus has to be grown in a laboratory. While that might sound easy, only 0.1% of the world’s microbes will grow in artificial media, with the success rate for viruses even lower.
Furthermore, studying microbes in isolation can be somewhat misleading because they commonly thrive in nature as tightly knit communities.
Metagenomics addresses both problems by exhaustively sequencing all the microbial DNA or RNA from a given environment. This powerful, direct approach immensely expands the scope of biological diversity accessible to researchers.
But the impact of metagenomics is not just quantitative. Over and again, metagenomic studies—because they look at microbes in their natural communities and are not restricted by the necessity to grow them in culture—result in discoveries with major biological implications and open up fresh experimental directions.
In virology, metagenomics has already become the primary route to new virus discovery. In fact, in a dramatic break from tradition, such discoveries are now formally recognized by the International Committee on Taxonomy of Viruses. This decision all but officially ushers in a new era, I think.
Here is just one striking example that highlights the growing power of metagenomics.
In 2014, Rob Edwards and colleagues at San Diego State University achieved a remarkable metagenomic feat. By sequencing multiple human gut microbiomes, they managed to assemble the genome of a novel bacteriophage, named crAssphage (for cross-Assembly). They then went on to show that crAssphage is, by a wide margin, the most abundant virus associated with humans.
This discovery was both a sensation and a shock. We had been completely blind to one of the key inhabitants of our own bodies—apparently because the bacterial host of the crAssphage would not grow in culture. Thus, some of the most common microbes in our intestines, and their equally ubiquitous viruses, represent “dark matter” that presently can be studied only by metagenomics.
But the crAssphage genome was dark in more than one way.
Once sequenced, it looked like nothing in the world. For most of its genes, researchers found no homologs in sequence databases, and even those few homologs identified shed little light on the biology of the phage. Furthermore, we had been unable to establish any links to other phages, nor could we tell which proteins formed the crAssphage particle.
Such results understandably frustrate experimenters, but computational biologists see opportunity.
A few days after the crAssphage genome was published, Mart Krupovic of Institut Pasteur visited my lab, where we attempted to decipher the genealogies and functions of the crAssphage proteins using all computational tools available to us at the time. The result was sheer disappointment. We detected some additional homologies but could not shed much light on the phage evolutionary relationships or reproduction strategy.
We moved on. With so many other genomes to analyze, crAssphage dropped from our radar.
Then, in April 2017, Anca Segall, a sabbatical visitor in my lab, invited Rob Edwards to give a seminar at NCBI about crAssphage. After listening to Rob’s stimulating talk—and realizing that the genome of this remarkable virus remains a terra incognita—we could not resist going back to the crAssphage genome armed with some new computational approaches and, more importantly, vastly expanded genomic and metagenomic sequence databases.
This time we got better results.
After about eight weeks of intensive computational analysis by Natalya Yutin, Kira Makarova, and myself, we had fairly complete genomic maps for a vast new family of crAssphage-related bacteriophages. For all these phages, we predicted with good confidence the main structural proteins, along with those involved in genome replication and expression. Our work led to a paper we recently published in the journal Nature Microbiology. We hope and believe our findings provide a roadmap for experimental study of these undoubtedly important viruses.
Apart from the immediate importance of the crAss-like phages, this story delivers a broader lesson. Thanks to the explosive growth of metagenomic databases, the discovery of a new virus or microbe does not stop there. It brings with it an excellent chance to discover a new viral or microbial family. In addition, analyzing the gene sequences can yield interesting and tractable predictions of new biology. However, to take advantage of the metagenomic treasure trove, we must creatively apply the most powerful sequence analysis methods available, and novel ones may be required.
Put another way, if you know where and how to look, you have an excellent chance to see wonders.
As a result, I cannot help being unabashedly optimistic about the future of metagenomics. Fueled by the synergy between increasingly high-quality, low-cost sequencing, improved computational methods, and emerging high-throughput experimental approaches, the prospects appear boundless. There is a realistic chance we will know the true extent of the diversity of life on earth and get unprecedented insights into its ecology and evolution within our lifetimes. This is something to work for.
Eugene Koonin, PhD, has served as a senior investigator at NLM’s National Center for Biotechnology Information since 1996, after working for five years as a visiting scientist. He has focused on the fields of computational biology and evolutionary genomics since 1984.