Burn Away the Old, Make Room for the New?

Preserving ideas and how they’re presented

Did you know that, as part of their New Year’s celebration, Icelanders set off more fireworks per person—about 3 kilos (over 6.5 pounds)—than anywhere else in the world? It’s all part of burning away the old year to welcome in the new.

Another Icelandic tradition: Lighting candles on New Year’s Eve to help the hidden people (Huldufólk, i.e., elves) find their way to new homes, a journey encouraged with the gentle bidding, “Come, those who wish. Stay, those who wish. Go, those who wish, harmless to me and mine.”

Out with the old, in with the new works in many parts of life, including the turn of the calendar, but in libraries, we usually seek to retain the old while acquiring the new. Most libraries are, in fact, committed to preserving the knowledge and information that has gone before, and NLM’s enabling legislation establishes preservation as one of our key functions, explicitly stating that we are to “acquire and preserve books, periodicals, prints, films, recordings, and other library materials pertinent to medicine.”

I understand the basics of preservation—devising efficient and sustainable ways to stabilize and retain materials and to ensure permanent access to them—but I gained a deeper insight into the why of preservation during my holiday travels to Reykjavik, thanks to the visionary library staff at the Nordic House.

As I listened to Margrét Asgersdottir, the Nordic House librarian, I began to see preservation in a whole new light.

Two women stand next to a Christmas tree made from stacked books
Margrét Asgersdottir and I at the Nordic House Library in Reykjavik

Operated by the Nordic Council of Ministers, the Nordic House’s small lending library arose out of a multi-national effort to ensure the preservation of Nordic languages. As a result, their print collection comprises works in six of the seven Nordic languages—Danish, Faroese, Finnish, Norwegian, Sámi, and Swedish. The library holds no books in Icelandic—the seventh Nordic language—with Iceland’s public libraries responsible for collecting those materials.

Margarét spoke with great pride of their commitment to preserve not only the ideas that emerged from Nordic writers, but the very way they were created as well. To Margarét and the Nordic House Library, preservation includes ensuring the vernacular remains intact—protecting both the content and the manner of expression. As she explained, the complexity of human thought and communication requires that we consider the impact and insights gleaned from both what  is said and how it is said. Moreover, preserving the how may have implications far beyond an enhanced understanding of the what. It may help keep a culture alive, connect ideas across generations and fields, and reveal nuances detectable only through specific vocabulary or sentence structure.

The idea that we must ensure the permanence of both thought and expression may be familiar to many librarians, but for me, it was an incredible and unexpected lesson.

This idea provides me, as the NLM Director, with solid justification to invest in the complex process of preservation and will help me better understand the issues at play regarding how and what to preserve. As my colleagues in our History of Medicine Division know well, it is sometimes necessary to preserve an artifact in its original form—be it the laboratory notebooks of Nobel laureate Marshall Nirenberg or hand-scripted Islamic medical manuscripts. There is meaning to be found in the union of thought and form.

As a result, I will never again walk through NLM’s incunabula room and simply marvel at the beautiful collection. Thanks to my time in Iceland, I bring a new and deeper commitment to preserving both the form and content of the world’s historical knowledge of health and medicine.

Photo credit (Fireworks Reykjavik 2013, top): Robert Parviainen [Flickr (CC BY-NC-SA 2.0) | cropped]

Exploring the Brave New World of Metagenomics

See last week’s post, “Adventures of a Computational Biologist in the Genome Space,” for Part 1 of Dr. Koonin’s musings on the importance of computational analysis in biomedical discovery.

While the genomic revolution rolls on, a new one has been quietly fomenting over the last decade or so, only to take over the science of microbiology in the last couple of years.

The name of this new game is metagenomics.

Metagenomics is concerned with the complex communities of microbes.

Traditionally, microbes have been studied in isolation, but to do that, a microbe or virus has to be grown in a laboratory. While that might sound easy, only 0.1% of the world’s microbes will grow in artificial media, with the success rate for viruses even lower.

Furthermore, studying microbes in isolation can be somewhat misleading because they commonly thrive in nature as tightly knit communities.

Metagenomics addresses both problems by exhaustively sequencing all the microbial DNA or RNA from a given environment. This powerful, direct approach immensely expands the scope of biological diversity accessible to researchers.

But the impact of metagenomics is not just quantitative. Over and again, metagenomic studies—because they look at microbes in their natural communities and are not restricted by the necessity to grow them in culture—result in discoveries with major biological implications and open up fresh experimental directions.

In virology, metagenomics has already become the primary route to new virus discovery. In fact, in a dramatic break from tradition, such discoveries are now formally recognized by the International Committee on Taxonomy of Viruses. This decision all but officially ushers in a new era, I think.

Here is just one striking example that highlights the growing power of metagenomics.

In 2014, Rob Edwards and colleagues at San Diego State University achieved a remarkable metagenomic feat. By sequencing multiple human gut microbiomes, they managed to assemble the genome of a novel bacteriophage, named crAssphage (for cross-Assembly). They then went on to show that crAssphage is, by a wide margin, the most abundant virus associated with humans.

This discovery was both a sensation and a shock. We had been completely blind to one of the key inhabitants of our own bodies—apparently because the bacterial host of the crAssphage would not grow in culture. Thus, some of the most common microbes in our intestines, and their equally ubiquitous viruses, represent “dark matter” that presently can be studied only by metagenomics.

But the crAssphage genome was dark in more than one way.

Once sequenced, it looked like nothing in the world. For most of its genes, researchers found no homologs in sequence databases, and even those few homologs identified shed little light on the biology of the phage. Furthermore, we had been unable to establish any links to other phages, nor could we tell which proteins formed the crAssphage particle.

Such results understandably frustrate experimenters, but computational biologists see opportunity.

A few days after the crAssphage genome was published, Mart Krupovic of Institut Pasteur visited my lab, where we attempted to decipher the genealogies and functions of the crAssphage proteins using all computational tools available to us at the time. The result was sheer disappointment. We detected some additional homologies but could not shed much light on the phage evolutionary relationships or reproduction strategy.

We moved on. With so many other genomes to analyze, crAssphage dropped from our radar.

Then, in April 2017, Anca Segall, a sabbatical visitor in my lab, invited Rob Edwards to give a seminar at NCBI about crAssphage. After listening to Rob’s stimulating talk—and realizing that the genome of this remarkable virus remains a terra incognita—we could not resist going back to the crAssphage genome armed with some new computational approaches and, more importantly, vastly expanded genomic and metagenomic sequence databases.

This time we got better results.

After about eight weeks of intensive computational analysis by Natalya Yutin, Kira Makarova, and myself, we had fairly complete genomic maps for a vast new family of crAssphage-related bacteriophages. For all these phages, we predicted with good confidence the main structural proteins, along with those involved in genome replication and expression. Our work led to a paper we recently published in the journal Nature Microbiology. We hope and believe our findings provide a roadmap for experimental study of these undoubtedly important viruses.

Apart from the immediate importance of the crAss-like phages, this story delivers a broader lesson. Thanks to the explosive growth of metagenomic databases, the discovery of a new virus or microbe does not stop there. It brings with it an excellent chance to discover a new viral or microbial family. In addition, analyzing the gene sequences can yield interesting and tractable predictions of new biology. However, to take advantage of the metagenomic treasure trove, we must creatively apply the most powerful sequence analysis methods available, and novel ones may be required.

Put another way, if you know where and how to look, you have an excellent chance to see wonders.

As a result, I cannot help being unabashedly optimistic about the future of metagenomics. Fueled by the synergy between increasingly high-quality, low-cost sequencing, improved computational methods, and emerging high-throughput experimental approaches, the prospects appear boundless. There is a realistic chance we will know the true extent of the diversity of life on earth and get unprecedented insights into its ecology and evolution within our lifetimes. This is something to work for.

casual headshot of Dr. KooninEugene Koonin, PhD, has served as a senior investigator at NLM’s National Center for Biotechnology Information since 1996, after working for five years as a visiting scientist. He has focused on the fields of computational biology and evolutionary genomics since 1984.

Adventures of a Computational Biologist in the Genome Space

Guest post by Dr. Eugene Koonin, NLM National Center for Biotechnology Information.

More than 30 years ago, when I started my research in computational biology (yes, it has been a while), it was not at all clear that one could do biological research using computers alone. Indeed, the common perception was that real insights into how organisms function and evolve could only be gained in the lab or in the field.

As so often happens in the history of science, that all changed when a new type of data arrived on the scene. Genetic sequence information, the blueprint for building all organisms, gave computational biology a foothold it has never relinquished.

As early as the 1960s, some prescient researchers—first among them Margaret Dayhoff at Georgetown University—foresaw genetic sequences becoming a key source of biological information, but this was far from  mainstream biology at the time. But through the 1980s, the trickle of sequences grew into a steady stream, and by the mid-1990s, the genomic revolution was upon us.

I still remember as if it were yesterday the excitement that overwhelmed me and my NCBI group in the waning days of 1995, when J. Craig Venter’s team released the first couple of complete bacterial genomes. Suddenly, the sequence analysis methods on which we and others had been working in relative obscurity had a role in trying to understand the genetic core of life. Soon after, my colleague, Arcady Mushegian, and I reconstructed a minimal cellular genome that attracted considerable attention, stimulating experiments that confirmed how accurate our purely computational effort had been.

Now, 22 years after the appearance of those first genomes, GenBank and related databases contain hundreds of thousands of genomic sequences encompassing millions of genes, and the utility and importance of computational biology are no longer a matter of debate. Indeed, biologists cannot possibly study even a sizable fraction of those genes experimentally, so, at the moment, computational analysis provides the only way to infer their biological functions.

Indeed, computational approaches have made possible many crucial biological discoveries. Two examples in which I and my NCBI colleagues have been actively involved are elucidating the architecture of the BRCA1 protein that, when impaired, can lead to breast cancer, and predicting the mode of action of CRISPR systems. Both findings sparked extensive experimentation in numerous laboratories all over the world. And, in the case of CRISPR, those experiments culminated in the development of a new generation of powerful genome-editing tools that have opened up unprecedented experimental opportunities and are likely to have major therapeutic potential.

But science does not stand still. Quite the contrary, it moves at an ever-accelerating pace and is prone to taking unexpected turns. Next week, I’ll explore one recent turn that has set us on a new path of discovery and understanding.

casual headshot of Dr. KooninEugene Koonin, PhD, has served as a senior investigator at NLM’s National Center for Biotechnology Information since 1996, after working for five years as a visiting scientist. He has focused on the fields of computational biology and evolutionary genomics since 1984.

Have yourself a…

…wonderful holiday? …Merry Christmas?  Joyous Kwanzaa? …Happy New Year?

Regardless of which holiday you celebrate, I invite you to join me in extending heartfelt greetings to our families and friends, those with whom we work and those whom we serve.

I am mindful that my “Merry Christmas” might not evoke in others memories similar to my own, of childhood delights and family time. And I try to be especially aware of others this time of year by respecting their experiences and traditions.

In that spirit, I encouraged you last year to commit yourself, at least once in the next year, to learn of the traditions of one of your colleagues—this will extend our holiday greetings year round!

So, what did you learn?

Personally, I made it a point to learn about NLM’s people and divisions located off the main NIH campus, whether in Bethesda, Rockville, or Virginia. (Meeting those of you working from afar or on alternative schedules are next year’s challenge!)

On one recent off-campus visit, I spent time with the Extramural Programs (EP) Division, which administers our grants program, including the university-based biomedical informatics and data science research training programs.

They are a dedicated and creative bunch. Not only do they manage an annual budget of almost $75 million and reviewin conjunction with expert panelsover 900 grant proposals each year, they can also seriously decorate for the holidays.

When I visited a few weeks ago, their office hallways were alight with bright Christmas greeting. The artful application of crepe paper, construction paper, and shiny ribbons gave me the sense of walking past a row of Christmas trees, though I’m sure the pine scent was all in my mind.

But these are grants folks. Competition is in their blood. So many of EP’s 19 staff have turned holiday decor into sport, with a race to see whose decorations go up first and, of course, which are the most attractive. That’s a debate I’m staying out of, but I do appreciate how it adds to the sense of celebration and festivity in the workplace and what it tells me about them, their camaraderie, spirit, and good humor.

So, what did you learn this year? Did you discover an colleague who celebrates Diwali, the Hindu festival of lights? Did you sit down to a Passover seder or break the Yom Kippur fast with someone? Or did you learn about a co-worker’s non-religious traditions, like a family reunion, an annual camping trip, or their favorite Thanksgiving dishes?

Such moments, when we step outside work talk and learn about each other, help forge positive connections and mutual respect, which, ultimately, are true hallmarks of the season.

Wishing you all good things, and may a sense of connection, richness, and celebration stay with you through the coming year!

Happy One Billion, PubMed Central!

The odometer on PubMed Central® turned over a slew of zeroes in October, when someone somewhere retrieved the ONE BILLIONTH article in 2017 from this free, full-text archive.

That’s one billion articles retrieved in less than 10 months—a breakneck pace on par with the iPhone App Store’s one billionth download, which took 9 months and  12 days back in 2009.

Astounding!

What makes PubMed Central (PMC) so popular?

Quality and quantity at a great price—all brought to you by a powerhouse partnership with publishers and research funders dedicated to making science more open and accessible.

PMC provides free permanent electronic access to the full text of over 4.6 million peer-reviewed biomedical and life sciences journal articles. It’s a digital counterpart to NLM’s extensive print journal collection, with the added advantage of being available 24/7 from around the globe.

Current articles follow one of two paths to get into PMC: they are deposited either by the journal publishers or by the authors themselves.

The first path delivers the lion’s share of articles to PMC. Over 2,400 journals have signed agreements to deposit directly to PMC the final published versions of some or all of their articles.

Authors, on the other hand, are commonly driven to deposit their peer-reviewed manuscripts by their agencies’ public access policies, which call for making federally funded research freely available to the public, generally within 12 months of publication.

At this point, aside from NIH, we’ve got 11 other organizations whose funded authors contribute a range of scientific findings to PMC, from sister agencies within HHS (Administration for Community Living, Agency for Healthcare Research and Quality, CDC, FDA, and the Office of the Assistant Secretary for Preparedness and Response) to other federal bodies (EPA, NASA, NIST, and the VA) to private research funders committed to information sharing and transparency (Bill & Melinda Gates Foundation and the Howard Hughes Medical Institute). The Department of Homeland Security will join this list early next year. In addition, our partner across the pond, Europe PMC, delivers content from 28 international funders.

All of that recent journal content is enriched by a deep well of historical articles spanning 200 years of biomedical research. Funding by the Wellcome Trust  has enabled us to scan thousands of complete back issues of historically-significant biomedical journals and make them freely available through PMC. That translates to more than 1.26 million articles—with more to come.

The result is a impressive collection of biomedical knowledge, most peer-reviewed and all freely available—even, in some cases, for text mining.

But as they say on TV, that’s not all.

As of October 2017, researchers funded by our partners can now deposit into PMC data and other supplementary files that support their published findings. It’s a move intended to nurture transparency, foster open science, and enhance reproducibility, while also facilitating data reuse—all key elements to the future of data-driven discovery we envision.

NLM is proud to work with the scientific community to bring this exciting scientific resource to the world.

So, congratulations, PubMed Central staff and every publisher and contributor who makes his or her work available this way! We couldn’t have reached this major milestone without you, and we look forward to reaching many more together.