Sharing Small Data Files for More Analysis Power

the double helix of DNA

Guest post by Ashley Hintz, curator for NLM NCBI’s Sequence Read Archive.

Imagine the rows upon rows of books in the stacks of a research library. Then imagine how much information is stored in those books.

An organism’s entire genome is the equivalent of that large library with its multiple floors of loaded shelves. It is densely packed with information.

Just as reference librarians specialize in locating specific information within the library, bioinformaticians specialize in working through, analyzing, and understanding genomes.

And as librarians determine how best to search the online catalog to locate a specific item within the library, bioinformaticians write computer scripts to filter the massive amount of genomic data and identify the parts most relevant to particular research questions. Such filtering creates smaller data files that are easier to analyze, visualize, and share with colleagues.

Unfortunately, many researchers don’t start with the smaller, filtered genomic data sets.

These researchers are stuck believing they must always begin their analysis from the full genome, even though most of the genome’s elements have nothing to do with their research questions.

The idea that you must have the original raw sequence data to analyze the data correctly means that large amounts of data (sometimes terabytes) get moved on a regular basis, or, given the data’s large size, they never move, so access to them is limited.

It’s the equivalent of checking out the whole library when a couple books will answer your questions.

Genomes can be filtered by SNPs (or single nucleotide polymorphisms) from a vcf file, collections of genes, or read counts for gene expression data. All of these smaller data files have practical-use cases and deliver what researchers are commonly interested in.

Current technology also makes sharing these filtered data files incredibly easy. Given their small size (mere gigabytes), it’s more on the scale of sharing pictures of your children or pets.

Sounds good, but what can one accomplish with these filtered genome datasets?

One example: perinatal screening for genes and/or SNPs that can potentially cause disease. Once the entire genome or exome is sequenced, the doctor can examine a vcf file of SNPs, filter that file further using data about the family’s medical history or the infant’s clinical symptoms, and then transfer the data files easily to a specialist for a second opinion. In these cases, the smaller data file becomes a powerful diagnostic tool.

A second example: gene expression data used in research. Instead of each scientist recalculating read counts (such as FPKMs), one researcher can share smaller analysis files with others. The smaller files can then be used to identify and visualize meaningful differences in expression between genes or samples. This sharing is not only more efficient; it also facilitates the collaboration necessary to tackle such significant and complex medical issues as obesity, Alzheimer’s, and autoimmune diseases, all of which show differences in the expression levels of certain genes.

These examples just scratch the surface of the power behind filtering genome data for data analysis, visualization, and sharing.

With genome sequencing growing more affordable, the amount of sequence data is expected to grow more rapidly, so we need to change the mindset that we must start with the whole library to see results. Think “focused” instead of “complete,” or, as the librarians like to say, “precision over recall.”

It’s one more example of where less, most definitely, can be more.

Ashley Hintz

Curator, Sequence Read Archive, National Center for Biotechnology Information, NLM

Leave a Reply