Lossless compression and its applications to large-scale genomic analyses

Séminaire
Date de début
Date de fin
Lieu
IRISA Rennes
Salle
Aurigny
Orateur
Diego Diaz (Helsinki University)

String compression is a central component in the analysis of massive genomic data. Numerous studies emphasize that we can drastically reduce costs if we factorize the redundant DNA patterns and operate in compressed space. In this sense, data structures and algorithms that rely on lossless compression to encode the strings offer the ideal solution as they retain all the genomic information while using sublinear space. These attributes enable the implementation of various functionalities, maximizing the biological insights derived from the input. However, prevalent genomic tools often limit lossless compression to permanent storage, with many bioinformatics software opting for lossy data analysis techniques. These methods, employing concepts such as kmers, minimizers, sequence sketching, or de Bruijn graphs, are computationally inexpensive but entail information loss. The choice of lossy compression over lossless alternatives in mainstream tools can be attributed partly to the fact that lossless data structures are expensive to construct, too static, and do not offer enough functionality for processing genomic data. Nevertheless, recent advancements within the string community are addressing these challenges, and it is only a matter of time before they become practical. This presentation will provide an overview of lossless data compression methods in Genomics. We will delve into classical statistical and dictionary-based techniques, exploring how researchers apply them in bioinformatics. We will also introduce recent compression-aware data structures and algorithms that balance compression ratio and scalability. Lastly, we will discuss the challenges of translating compression-aware data structures into practical software and explore the types of queries pertinent to genomics analyses.

Symbiose seminars: https://www.cesgo.org/symbiose/seminars/lossless-compression-and-its-ap…

For internal attendees