Exploring the Brave New World of Metagenomics

See last week’s post, “Adventures of a Computational Biologist in the Genome Space,” for Part 1 of Dr. Koonin’s musings on the importance of computational analysis in biomedical discovery.

While the genomic revolution rolls on, a new one has been quietly fomenting over the last decade or so, only to take over the science of microbiology in the last couple of years.

The name of this new game is metagenomics.

Metagenomics is concerned with the complex communities of microbes.

Traditionally, microbes have been studied in isolation, but to do that, a microbe or virus has to be grown in a laboratory. While that might sound easy, only 0.1% of the world’s microbes will grow in artificial media, with the success rate for viruses even lower.

Furthermore, studying microbes in isolation can be somewhat misleading because they commonly thrive in nature as tightly knit communities.

Metagenomics addresses both problems by exhaustively sequencing all the microbial DNA or RNA from a given environment. This powerful, direct approach immensely expands the scope of biological diversity accessible to researchers.

But the impact of metagenomics is not just quantitative. Over and again, metagenomic studies—because they look at microbes in their natural communities and are not restricted by the necessity to grow them in culture—result in discoveries with major biological implications and open up fresh experimental directions.

In virology, metagenomics has already become the primary route to new virus discovery. In fact, in a dramatic break from tradition, such discoveries are now formally recognized by the International Committee on Taxonomy of Viruses. This decision all but officially ushers in a new era, I think.

Here is just one striking example that highlights the growing power of metagenomics.

In 2014, Rob Edwards and colleagues at San Diego State University achieved a remarkable metagenomic feat. By sequencing multiple human gut microbiomes, they managed to assemble the genome of a novel bacteriophage, named crAssphage (for cross-Assembly). They then went on to show that crAssphage is, by a wide margin, the most abundant virus associated with humans.

This discovery was both a sensation and a shock. We had been completely blind to one of the key inhabitants of our own bodies—apparently because the bacterial host of the crAssphage would not grow in culture. Thus, some of the most common microbes in our intestines, and their equally ubiquitous viruses, represent “dark matter” that presently can be studied only by metagenomics.

But the crAssphage genome was dark in more than one way.

Once sequenced, it looked like nothing in the world. For most of its genes, researchers found no homologs in sequence databases, and even those few homologs identified shed little light on the biology of the phage. Furthermore, we had been unable to establish any links to other phages, nor could we tell which proteins formed the crAssphage particle.

Such results understandably frustrate experimenters, but computational biologists see opportunity.

A few days after the crAssphage genome was published, Mart Krupovic of Institut Pasteur visited my lab, where we attempted to decipher the genealogies and functions of the crAssphage proteins using all computational tools available to us at the time. The result was sheer disappointment. We detected some additional homologies but could not shed much light on the phage evolutionary relationships or reproduction strategy.

We moved on. With so many other genomes to analyze, crAssphage dropped from our radar.

Then, in April 2017, Anca Segall, a sabbatical visitor in my lab, invited Rob Edwards to give a seminar at NCBI about crAssphage. After listening to Rob’s stimulating talk—and realizing that the genome of this remarkable virus remains a terra incognita—we could not resist going back to the crAssphage genome armed with some new computational approaches and, more importantly, vastly expanded genomic and metagenomic sequence databases.

This time we got better results.

After about eight weeks of intensive computational analysis by Natalya Yutin, Kira Makarova, and myself, we had fairly complete genomic maps for a vast new family of crAssphage-related bacteriophages. For all these phages, we predicted with good confidence the main structural proteins, along with those involved in genome replication and expression. Our work led to a paper we recently published in the journal Nature Microbiology. We hope and believe our findings provide a roadmap for experimental study of these undoubtedly important viruses.

Apart from the immediate importance of the crAss-like phages, this story delivers a broader lesson. Thanks to the explosive growth of metagenomic databases, the discovery of a new virus or microbe does not stop there. It brings with it an excellent chance to discover a new viral or microbial family. In addition, analyzing the gene sequences can yield interesting and tractable predictions of new biology. However, to take advantage of the metagenomic treasure trove, we must creatively apply the most powerful sequence analysis methods available, and novel ones may be required.

Put another way, if you know where and how to look, you have an excellent chance to see wonders.

As a result, I cannot help being unabashedly optimistic about the future of metagenomics. Fueled by the synergy between increasingly high-quality, low-cost sequencing, improved computational methods, and emerging high-throughput experimental approaches, the prospects appear boundless. There is a realistic chance we will know the true extent of the diversity of life on earth and get unprecedented insights into its ecology and evolution within our lifetimes. This is something to work for.

casual headshot of Dr. KooninEugene Koonin, PhD, has served as a senior investigator at NLM’s National Center for Biotechnology Information since 1996, after working for five years as a visiting scientist. He has focused on the fields of computational biology and evolutionary genomics since 1984.

Adventures of a Computational Biologist in the Genome Space

Guest post by Dr. Eugene Koonin, NLM National Center for Biotechnology Information.

More than 30 years ago, when I started my research in computational biology (yes, it has been a while), it was not at all clear that one could do biological research using computers alone. Indeed, the common perception was that real insights into how organisms function and evolve could only be gained in the lab or in the field.

As so often happens in the history of science, that all changed when a new type of data arrived on the scene. Genetic sequence information, the blueprint for building all organisms, gave computational biology a foothold it has never relinquished.

As early as the 1960s, some prescient researchers—first among them Margaret Dayhoff at Georgetown University—foresaw genetic sequences becoming a key source of biological information, but this was far from  mainstream biology at the time. But through the 1980s, the trickle of sequences grew into a steady stream, and by the mid-1990s, the genomic revolution was upon us.

I still remember as if it were yesterday the excitement that overwhelmed me and my NCBI group in the waning days of 1995, when J. Craig Venter’s team released the first couple of complete bacterial genomes. Suddenly, the sequence analysis methods on which we and others had been working in relative obscurity had a role in trying to understand the genetic core of life. Soon after, my colleague, Arcady Mushegian, and I reconstructed a minimal cellular genome that attracted considerable attention, stimulating experiments that confirmed how accurate our purely computational effort had been.

Now, 22 years after the appearance of those first genomes, GenBank and related databases contain hundreds of thousands of genomic sequences encompassing millions of genes, and the utility and importance of computational biology are no longer a matter of debate. Indeed, biologists cannot possibly study even a sizable fraction of those genes experimentally, so, at the moment, computational analysis provides the only way to infer their biological functions.

Indeed, computational approaches have made possible many crucial biological discoveries. Two examples in which I and my NCBI colleagues have been actively involved are elucidating the architecture of the BRCA1 protein that, when impaired, can lead to breast cancer, and predicting the mode of action of CRISPR systems. Both findings sparked extensive experimentation in numerous laboratories all over the world. And, in the case of CRISPR, those experiments culminated in the development of a new generation of powerful genome-editing tools that have opened up unprecedented experimental opportunities and are likely to have major therapeutic potential.

But science does not stand still. Quite the contrary, it moves at an ever-accelerating pace and is prone to taking unexpected turns. Next week, I’ll explore one recent turn that has set us on a new path of discovery and understanding.

casual headshot of Dr. KooninEugene Koonin, PhD, has served as a senior investigator at NLM’s National Center for Biotechnology Information since 1996, after working for five years as a visiting scientist. He has focused on the fields of computational biology and evolutionary genomics since 1984.

Calling on Librarians to Help Ensure the Credibility of Published Research Results

Guest post by Jennifer Marill, Kathryn Funk, and Jerry Sheehan.

The National Institutes of Health (NIH) took a simple but significant step Friday to protect the credibility of published findings from its funded research.

NIH Guide Notice OD-18-011 calls upon NIH stakeholders to help authors of scientific journal articles adhere to the principles of research integrity and publication ethics; identify journals that follow best practices promoted by professional scholarly publishing organizations; and avoid publishing in journals that do not have a clearly stated and rigorous peer review process. The notice identifies several resources authors can consult when considering publishing venues, including Think Check Submit, a publishing industry resource, and consumer information on predatory journals from the Federal Trade Commission.

Librarians have an especially important role to play in guiding researcher-authors to high-quality journals. Librarians regularly develop and apply rigorous collection criteria when selecting journals to include in their collections and make available to their constituents. Librarians promote high-quality journals of relevance to their local communities. As a result, librarians are extremely familiar with journal publishers and the journals their constituents use for research and publication.

The National Library of Medicine (NLM) is no exception. One of NLM’s important functions is to select journals for its collection. The journal guidelines from the NLM Collection Development Manual call for journals that demonstrate good editorial quality and elements that contribute to the objectivity, credibility, and scientific quality of its content. It expects journals and journal publishers to conform with guidelines and best practices promoted by professional scholarly publishing organizations, such as the recommendations of the International Committee of Medical Journal Editors and the joint statement of principles of the Committee on Publication Ethics, Directory of Open Access Journals, Open Access Scholarly Publishers Association and World Association of Medical Editors.

Criteria for accepting journals for MEDLINE or PubMed Central are even more selective, reflecting the considerable resources associated with indexing the literature and providing long-term preservation and public access to full-text literature. MEDLINE currently indexes some 5,600 journals; PubMed Central has about 2,000 journals that regularly submit their full content. PubMed Central is also the repository for the articles resulting from NIH-funded research.

For the most part, NIH-funded researchers do a good job of publishing in high-quality journals.  More than 815,000 journal articles reporting on NIH-funded research have been made publicly accessible in PubMed Central since the NIH Public Access policy became mandatory in 2008. More than 90 percent of these articles are published in journals currently indexed in MEDLINE. The remainder are distributed across thousands of journals, some 3,000 of which have only a single article in PubMed Central. While many are quality journals with sound editorial practices, effective peer review, and scientific merit, it can often be difficult for a researcher-author to evaluate these factors.

That’s where local librarians can be of great assistance. And many already are—helping researchers at their local institutions select publishing venues.

If you have a good practice in your library, let us know about it so we can all learn how best to protect the credibility of published research results.

Jennifer Marill serves as chief of NLM’s Technical Services Division and the Library’s collection development officer. Kathryn Funk is a program manager and librarian for PubMed Central. And Jerry Sheehan is the Library’s deputy director.

Mining for Treasure, Discovering MEDLINE

Reusing a noteworthy dataset to great result

Guest post by Joyce Backus and Kathel Dunn, both from NLM’s Division of Library Operations.

As shrinking budgets tighten belts at hospitals and academic institutions, medical libraries have come under scrutiny. In response, librarians have had to articulate the value they bring to the institution and to the customers—students, researchers, clinicians, or patients—they serve.

In 2011-2012, as such scrutiny swelled, Joanne Marshall and her team set out to study the very question these medical institutions faced: Do libraries add value? They collected 16,122 individual responses from health professionals at 118 hospitals served by 56 health libraries in the United States and Canada. The team sought to determine whether physicians, residents, and nurses perceived their libraries’ information resources as valuable and whether the information obtained impacted patient care.

The resulting article, “The Value of Library and Information Services in Patient Care,” published in 2013, gave medical librarians strong talking points, including the overall perceived value of libraries as time-savers that positively impact patient care.

Now the datasets from that study are being reused to great result.

Over the last year we teamed up with Joanne Marshall and Amber Wells, both from the University of North Carolina-Chapel Hill, to dive into the data.

Our goal: to understand the value and impact of MEDLINE in medical libraries.

We re-discovered (as has been written about before) the value of MEDLINE in changing patient care. We also found its preeminent role shines even more brightly in a dataset like this one that includes other sources. We saw the significance of MEDLINE as a single source of information but also as a source used in combination with full-text journals, books, drug databases, websites, and colleague consultations.

We were reminded, too, of the importance of the National Network of Libraries of Medicine (NNLM) to our work; the trust in the NNLM; each library’s connectedness to the other; and how the everyday web of relationships prompts cooperation and collaboration, including the successful implementation of the value of libraries study itself.

For us this re-discovery comes at a key time, when we’re examining NLM products and services as part of the strategic planning process. We are actively identifying methodologies and tools to elevate all our collections—from datasets to incunabula—and make them greater engines of discovery in service of health.

But what about your library’s resources?

The data mining challenge we gave ourselves is our guide for medical librarians everywhere: look at your data, what’s in front of you, and then others’ data. What can they tell you about what’s happening now, what will likely happen in the future, what’s being used, and how it’s being used?

If you don’t know where to start, check out the Medical Library Association’s Research Training Institute, recommended research skills, and mentoring program. In addition, the NNLM’s site on program evaluation includes tools for determining cost benefit and return on investment.

Librarians positively impact health care and health care research. Now it’s time to have that same impact on our own profession. The data are there. It’s time we see what they have to tell us.

More information

Value of Library and Information Services in Patient Care Study


Lindberg DA, Siegel ER, Rapp BA, Wallingford KT, Wilson SR. Use of MEDLINE by physicians for clinical problem solving. JAMA. 1993; 269: 3124-9.

Demner-Fushman D, Hauser SE, Humphrey SM, Ford GM, Jacobs JL, Thoma GR. MEDLINE as a source of just-in-time answers to clinical questions. AMIA Annual Symposium Proceedings. 2006:190-4.

Sneiderman CA, Demner-Fushman D, Fiszman M, Ide NC, Rindflesch TC. Knowledge-based methods to help clinicians find answers in MEDLINE. Journal of the American Medical Informatics Association. 2007 Nov-Dec; 14(6):772-80.

Joyce Backus serves as the associate director for Library Operations at NLM. Kathel Dunn is the NLM Associate Fellowship coordinator.

Photo credit (ammonite, top): William Warby [Wikimedia Commons (CC BY 2.0)]

Addressing Health Disparities to the Benefit of All

Guest post by Lisa Lang, head of NLM’s National Information Center on Health Services Research and Health Care Technology

Singer-actress Selena Gomez shocked her fans this past September with the announcement that she had received a kidney transplant to combat organ damage caused by lupus.

Lupus, an autoimmune condition, strikes women much more than men, with minority women especially vulnerable. Not only is lupus two to three times more common in African American women than in Caucasian women, but recent studies funded by the CDC suggest that, like Ms. Gomez, Hispanic and non-Hispanic Asian women are more likely to have lupus-related kidney disease (lupus nephritis)—a potentially fatal complication.

Documenting such health disparities is crucial to understanding and addressing them. Significantly, the studies mentioned above are the first registries in the United States with sufficient Asians and Hispanics involved to measure the number of people diagnosed with lupus within these populations.

Investment in research examining potential solutions for health care disparities is essential.

In 2014, The Lancet featured a study that examined patterns, gaps, and directions of health disparity and equity research. Jointly conducted by the American Academy of Medical Colleges and AcademyHealth, a non-profit dedicated to enhancing and promoting health services research and a long-time NLM partner, the study examined changes in US investments in health equities and disparities research over time. Using abstracts in the NLM database HSRProj (Health Services Research Projects in Progress), the researchers found an overall shift in disparities-focused projects. From 2007 to 2011, health services research studies seeking to document specific disparities gave way to studies examining how best to alleviate such disparities. In fact, over half of the disparities-focused health services research funded in 2011 “aimed to reduce or eliminate a documented inequity.” The researchers also found significant differences in the attention given to particular conditions, groups, and outcomes. An update by AcademyHealth (publication forthcoming) found these differences continue in more recently funded HSR projects.

A more nuanced appreciation of affected groups is also critical to addressing health disparities. For example, the designation “Hispanic” is an over-simplification, an umbrella construct that obscures potentially important cultural, environmental, and even genetic differences we must acknowledge and appreciate if we are to maximize the benefits promised by personalized medicine. Reviews such as “Hispanic health in the USA: a scoping review of the literature” and “Controversies and evidence for cardiovascular disease in the diverse Hispanic population” highlight questions and conditions that would be informed by richer, more granular, data.

Lupus is one such condition. Research into this disease’s prevalence and impact among Hispanics is underway, but more attention may be warranted. There are almost 100 active clinical studies in the US targeting lupus currently listed in ClinicalTrials.gov and, of these, 15 address lupus nephritis. And while about 5% of ongoing or recently completed projects in the HSRProj database explicitly focus on Hispanic populations, only one, funded by the Patient-Centered Outcomes Research Institute, specifically addresses lupus. (You can see this study’s baseline measures and results on ClinicalTrials.gov.)

Perhaps a celebrity like Ms. Gomez publicly discussing her experience with lupus will spark more attention from both researchers and the public seeking to contribute to knowledge and cures.

After all, we are all both fundamentally unique and alike. Reducing—or better yet, eliminating—health disparities benefits us all.

Guest blogger Lisa Lang is Assistant Director for Health Services Research Information and also Head of NLM’s National Information Center on Health Services Research and Health Care Technology (NICHSR).

Photo credit (The Scales of Justice, top): Darius Norvilas [Flickr (CC BY-NC 2.0)] | altered background