Comprehensive collections of genomes have been instrumental in the study of life using sequencing; however, their storage, transmission, and analysis have become challenging due to the exponential growth of sequence data. This raises the question of designing efficient computational solutions for storing and indexing large data sets, such as the recently created corpus of 661k bacterial genomes, uniformly assembled from the European Nucleotide Archive. Here, we present a method for large-scale lossless compression and search of microbial collections, using the Tree of Life as a biological prior on their redundancy structure. Using state-of-the-art tools and databases from population genomics and metagenomics, our method infers the geometrical structure of a given collection, which is then used for guiding data compression using standard approaches. We demonstrate the applicability of our approach on large collections of genome assemblies, de Bruijn graphs, and k-mer indexes, and show this enables performing BLAST-like alignments to the 661k data set on a standard desktop computer within several hours. The optimization of data structures using the Tree of Life has broad applications across computational biology and provides a fundamental design principle for future genomics infrastructure.
Karel Břinda (Inria)