Exploring the Brave New World of Metagenomics

See last week’s post, “Adventures of a Computational Biologist in the Genome Space,” for Part 1 of Dr. Koonin’s musings on the importance of computational analysis in biomedical discovery.

While the genomic revolution rolls on, a new one has been quietly fomenting over the last decade or so, only to take over the science of microbiology in the last couple of years.

The name of this new game is metagenomics.

Metagenomics is concerned with the complex communities of microbes.

Traditionally, microbes have been studied in isolation, but to do that, a microbe or virus has to be grown in a laboratory. While that might sound easy, only 0.1% of the world’s microbes will grow in artificial media, with the success rate for viruses even lower.

Furthermore, studying microbes in isolation can be somewhat misleading because they commonly thrive in nature as tightly knit communities.

Metagenomics addresses both problems by exhaustively sequencing all the microbial DNA or RNA from a given environment. This powerful, direct approach immensely expands the scope of biological diversity accessible to researchers.

But the impact of metagenomics is not just quantitative. Over and again, metagenomic studies—because they look at microbes in their natural communities and are not restricted by the necessity to grow them in culture—result in discoveries with major biological implications and open up fresh experimental directions.

In virology, metagenomics has already become the primary route to new virus discovery. In fact, in a dramatic break from tradition, such discoveries are now formally recognized by the International Committee on Taxonomy of Viruses. This decision all but officially ushers in a new era, I think.

Here is just one striking example that highlights the growing power of metagenomics.

In 2014, Rob Edwards and colleagues at San Diego State University achieved a remarkable metagenomic feat. By sequencing multiple human gut microbiomes, they managed to assemble the genome of a novel bacteriophage, named crAssphage (for cross-Assembly). They then went on to show that crAssphage is, by a wide margin, the most abundant virus associated with humans.

This discovery was both a sensation and a shock. We had been completely blind to one of the key inhabitants of our own bodies—apparently because the bacterial host of the crAssphage would not grow in culture. Thus, some of the most common microbes in our intestines, and their equally ubiquitous viruses, represent “dark matter” that presently can be studied only by metagenomics.

But the crAssphage genome was dark in more than one way.

Once sequenced, it looked like nothing in the world. For most of its genes, researchers found no homologs in sequence databases, and even those few homologs identified shed little light on the biology of the phage. Furthermore, we had been unable to establish any links to other phages, nor could we tell which proteins formed the crAssphage particle.

Such results understandably frustrate experimenters, but computational biologists see opportunity.

A few days after the crAssphage genome was published, Mart Krupovic of Institut Pasteur visited my lab, where we attempted to decipher the genealogies and functions of the crAssphage proteins using all computational tools available to us at the time. The result was sheer disappointment. We detected some additional homologies but could not shed much light on the phage evolutionary relationships or reproduction strategy.

We moved on. With so many other genomes to analyze, crAssphage dropped from our radar.

Then, in April 2017, Anca Segall, a sabbatical visitor in my lab, invited Rob Edwards to give a seminar at NCBI about crAssphage. After listening to Rob’s stimulating talk—and realizing that the genome of this remarkable virus remains a terra incognita—we could not resist going back to the crAssphage genome armed with some new computational approaches and, more importantly, vastly expanded genomic and metagenomic sequence databases.

This time we got better results.

After about eight weeks of intensive computational analysis by Natalya Yutin, Kira Makarova, and myself, we had fairly complete genomic maps for a vast new family of crAssphage-related bacteriophages. For all these phages, we predicted with good confidence the main structural proteins, along with those involved in genome replication and expression. Our work led to a paper we recently published in the journal Nature Microbiology. We hope and believe our findings provide a roadmap for experimental study of these undoubtedly important viruses.

Apart from the immediate importance of the crAss-like phages, this story delivers a broader lesson. Thanks to the explosive growth of metagenomic databases, the discovery of a new virus or microbe does not stop there. It brings with it an excellent chance to discover a new viral or microbial family. In addition, analyzing the gene sequences can yield interesting and tractable predictions of new biology. However, to take advantage of the metagenomic treasure trove, we must creatively apply the most powerful sequence analysis methods available, and novel ones may be required.

Put another way, if you know where and how to look, you have an excellent chance to see wonders.

As a result, I cannot help being unabashedly optimistic about the future of metagenomics. Fueled by the synergy between increasingly high-quality, low-cost sequencing, improved computational methods, and emerging high-throughput experimental approaches, the prospects appear boundless. There is a realistic chance we will know the true extent of the diversity of life on earth and get unprecedented insights into its ecology and evolution within our lifetimes. This is something to work for.

casual headshot of Dr. KooninEugene Koonin, PhD, has served as a senior investigator at NLM’s National Center for Biotechnology Information since 1996, after working for five years as a visiting scientist. He has focused on the fields of computational biology and evolutionary genomics since 1984.

Adventures of a Computational Biologist in the Genome Space

Guest post by Dr. Eugene Koonin, NLM National Center for Biotechnology Information.

More than 30 years ago, when I started my research in computational biology (yes, it has been a while), it was not at all clear that one could do biological research using computers alone. Indeed, the common perception was that real insights into how organisms function and evolve could only be gained in the lab or in the field.

As so often happens in the history of science, that all changed when a new type of data arrived on the scene. Genetic sequence information, the blueprint for building all organisms, gave computational biology a foothold it has never relinquished.

As early as the 1960s, some prescient researchers—first among them Margaret Dayhoff at Georgetown University—foresaw genetic sequences becoming a key source of biological information, but this was far from  mainstream biology at the time. But through the 1980s, the trickle of sequences grew into a steady stream, and by the mid-1990s, the genomic revolution was upon us.

I still remember as if it were yesterday the excitement that overwhelmed me and my NCBI group in the waning days of 1995, when J. Craig Venter’s team released the first couple of complete bacterial genomes. Suddenly, the sequence analysis methods on which we and others had been working in relative obscurity had a role in trying to understand the genetic core of life. Soon after, my colleague, Arcady Mushegian, and I reconstructed a minimal cellular genome that attracted considerable attention, stimulating experiments that confirmed how accurate our purely computational effort had been.

Now, 22 years after the appearance of those first genomes, GenBank and related databases contain hundreds of thousands of genomic sequences encompassing millions of genes, and the utility and importance of computational biology are no longer a matter of debate. Indeed, biologists cannot possibly study even a sizable fraction of those genes experimentally, so, at the moment, computational analysis provides the only way to infer their biological functions.

Indeed, computational approaches have made possible many crucial biological discoveries. Two examples in which I and my NCBI colleagues have been actively involved are elucidating the architecture of the BRCA1 protein that, when impaired, can lead to breast cancer, and predicting the mode of action of CRISPR systems. Both findings sparked extensive experimentation in numerous laboratories all over the world. And, in the case of CRISPR, those experiments culminated in the development of a new generation of powerful genome-editing tools that have opened up unprecedented experimental opportunities and are likely to have major therapeutic potential.

But science does not stand still. Quite the contrary, it moves at an ever-accelerating pace and is prone to taking unexpected turns. Next week, I’ll explore one recent turn that has set us on a new path of discovery and understanding.

casual headshot of Dr. KooninEugene Koonin, PhD, has served as a senior investigator at NLM’s National Center for Biotechnology Information since 1996, after working for five years as a visiting scientist. He has focused on the fields of computational biology and evolutionary genomics since 1984.

Have yourself a…

…wonderful holiday? …Merry Christmas?  Joyous Kwanzaa? …Happy New Year?

Regardless of which holiday you celebrate, I invite you to join me in extending heartfelt greetings to our families and friends, those with whom we work and those whom we serve.

I am mindful that my “Merry Christmas” might not evoke in others memories similar to my own, of childhood delights and family time. And I try to be especially aware of others this time of year by respecting their experiences and traditions.

In that spirit, I encouraged you last year to commit yourself, at least once in the next year, to learn of the traditions of one of your colleagues—this will extend our holiday greetings year round!

So, what did you learn?

Personally, I made it a point to learn about NLM’s people and divisions located off the main NIH campus, whether in Bethesda, Rockville, or Virginia. (Meeting those of you working from afar or on alternative schedules are next year’s challenge!)

On one recent off-campus visit, I spent time with the Extramural Programs (EP) Division, which administers our grants program, including the university-based biomedical informatics and data science research training programs.

They are a dedicated and creative bunch. Not only do they manage an annual budget of almost $75 million and reviewin conjunction with expert panelsover 900 grant proposals each year, they can also seriously decorate for the holidays.

When I visited a few weeks ago, their office hallways were alight with bright Christmas greeting. The artful application of crepe paper, construction paper, and shiny ribbons gave me the sense of walking past a row of Christmas trees, though I’m sure the pine scent was all in my mind.

But these are grants folks. Competition is in their blood. So many of EP’s 19 staff have turned holiday decor into sport, with a race to see whose decorations go up first and, of course, which are the most attractive. That’s a debate I’m staying out of, but I do appreciate how it adds to the sense of celebration and festivity in the workplace and what it tells me about them, their camaraderie, spirit, and good humor.

So, what did you learn this year? Did you discover an colleague who celebrates Diwali, the Hindu festival of lights? Did you sit down to a Passover seder or break the Yom Kippur fast with someone? Or did you learn about a co-worker’s non-religious traditions, like a family reunion, an annual camping trip, or their favorite Thanksgiving dishes?

Such moments, when we step outside work talk and learn about each other, help forge positive connections and mutual respect, which, ultimately, are true hallmarks of the season.

Wishing you all good things, and may a sense of connection, richness, and celebration stay with you through the coming year!

Happy One Billion, PubMed Central!

The odometer on PubMed Central® turned over a slew of zeroes in October, when someone somewhere retrieved the ONE BILLIONTH article in 2017 from this free, full-text archive.

That’s one billion articles retrieved in less than 10 months—a breakneck pace on par with the iPhone App Store’s one billionth download, which took 9 months and  12 days back in 2009.

Astounding!

What makes PubMed Central (PMC) so popular?

Quality and quantity at a great price—all brought to you by a powerhouse partnership with publishers and research funders dedicated to making science more open and accessible.

PMC provides free permanent electronic access to the full text of over 4.6 million peer-reviewed biomedical and life sciences journal articles. It’s a digital counterpart to NLM’s extensive print journal collection, with the added advantage of being available 24/7 from around the globe.

Current articles follow one of two paths to get into PMC: they are deposited either by the journal publishers or by the authors themselves.

The first path delivers the lion’s share of articles to PMC. Over 2,400 journals have signed agreements to deposit directly to PMC the final published versions of some or all of their articles.

Authors, on the other hand, are commonly driven to deposit their peer-reviewed manuscripts by their agencies’ public access policies, which call for making federally funded research freely available to the public, generally within 12 months of publication.

At this point, aside from NIH, we’ve got 11 other organizations whose funded authors contribute a range of scientific findings to PMC, from sister agencies within HHS (Administration for Community Living, Agency for Healthcare Research and Quality, CDC, FDA, and the Office of the Assistant Secretary for Preparedness and Response) to other federal bodies (EPA, NASA, NIST, and the VA) to private research funders committed to information sharing and transparency (Bill & Melinda Gates Foundation and the Howard Hughes Medical Institute). The Department of Homeland Security will join this list early next year. In addition, our partner across the pond, Europe PMC, delivers content from 28 international funders.

All of that recent journal content is enriched by a deep well of historical articles spanning 200 years of biomedical research. Funding by the Wellcome Trust  has enabled us to scan thousands of complete back issues of historically-significant biomedical journals and make them freely available through PMC. That translates to more than 1.26 million articles—with more to come.

The result is a impressive collection of biomedical knowledge, most peer-reviewed and all freely available—even, in some cases, for text mining.

But as they say on TV, that’s not all.

As of October 2017, researchers funded by our partners can now deposit into PMC data and other supplementary files that support their published findings. It’s a move intended to nurture transparency, foster open science, and enhance reproducibility, while also facilitating data reuse—all key elements to the future of data-driven discovery we envision.

NLM is proud to work with the scientific community to bring this exciting scientific resource to the world.

So, congratulations, PubMed Central staff and every publisher and contributor who makes his or her work available this way! We couldn’t have reached this major milestone without you, and we look forward to reaching many more together.

Models: The Third Leg in Data-Driven Discovery

Considering a library of models

George Box, a famous statistician, once remarked, “All models are wrong, and some are useful.”

As representations or approximations of real-world phenomena, models, when done well, can be very useful.  In fact, they serve as the third leg to the stool that is data-driven discovery, joining the published literature and its underlying data to give investigators the materials necessary to explore important dynamics in health and biomedicine.

By isolating and replicating key aspects within complex phenomena, models help us better understand what’s going on and how the pieces or processes fit together.

Because of the complexity within biomedicine, health care research must employ different kinds of models, depending on what’s being looked at.

Regardless of the type used, however, models take time to build, because the model builder must first understand the elements of the phenomena that must be represented. Only then can she select the appropriate modeling tools and build the model.

Tracking and storing models can help with that.

Not only would tracking models enable re-use—saving valuable time and money—but doing so would enhance the rigor and reproducibility of the research itself by giving scientists the ability to see and test the methodology behind the data.

Enter libraries.

As we’ve done for the literature, libraries can help document and preserve models and make them discoverable.

The first step in that is identifying and collecting useful models.

Second, we’d have to apply metadata to describe the models. Among the essential elements to include in such descriptions might be model type, purpose, key underlying assumptions, referent scale, and indicators of how and when the model was used.

screencapture with the DOI and RRIDs highlighted
The DOI and RRIDs in a current PubMed record.
(Click to enlarge.)

We’d then need to apply one or more unique identifiers to help with curation. Currently, two different schema provide principled ways to identify models: the Digital Object Identifier (DOI) and the Research Resource Identifier (RRID). The former provides a persistent, unique code to track an item or entity at an overarching level (e.g., an article or book).  The latter documents the main resources used to produce the scientific findings in that article or book (e.g., antibodies, model organisms, computational models).

Just as clicking on an author’s name in PubMed can bring up all the articles he or she has written, these interoperable identifiers, once assigned to research models, make it possible to connect the studies employing those models.  Effectively, these identifiers can tie together the three components that underpin data-driven discovery—the literature, the supporting data, and the analytical tools—thus enhancing discoverability and streamlining scientific communication.

NLM’s long-standing role in collecting, organizing, and making available the biomedical literature positions us well to take on the task of tracking research models, but is that something we should do?

If so, what might that library of models look like? What else should it include? And how useful would this library of models be to you?

Photo credit (stool, top): Doug Belshaw [Flickr (CC BY 2.0) | erased text from original]