Computational methods for phylogenetic compression

Publié le ven 19/01/2024 - 19:39

Equipe

GENSCALE

Site web de l'équipe

https://team.inria.fr/genscale/

Date de début de thèse (si connue)

2024-09-01

Lieu

Rennes

Unité de recherche

IRISA - UMR 6074

Description du sujet de la thèse

The emergence of modern DNA sequencing methods revolutionized biology, epidemiology, and medicine. However, the rapid growth of sequencing data, surpassing Moore’s law, makes it increasingly difficult to search these data using the traditional methods, such as Basic Local Alignment Search Tool (BLAST) [1] and its successors. To keep up with the volumes of data being generated, we need novel compressive approaches that could store and analyze data directly in the compressed domain. However, while it has been shown that genomic data have particular geometrical properties that guarantee the existence of such efficient algorithms [2], [3], the compression of exponentially growing genome collections in a scalable manner remains an unsolved challenge.

A recent technique called phylogenetic compression provided a possible solution [4]. The technique builds upon the insight that the redundancies in genomic data have in fact a highly predictable structure, mirroring the underlying evolutionary processes. As the evolutionary history of genomes can be rapidly estimated, this can be embedded directly into the compression and search algorithms, in order to increase their efficiency and scalability. The preliminary results demonstrated remarkable improvements over the state-of-the-art compression methods, and also suggested its applicability to many diverse data structures across bioinformatics [4].

The goal of this PhD thesis will be developing novel computational methods and mathematical tools for phylogenetic compression of data structures. The student will combine skills in algorithm design and analysis, discrete mathematics and probability, with a taste for programming and experimentation with big genomic data. One focus of the thesis will be the development of a mathematical framework for formalizing phylogenetic compression as an optimization problem, with the use of data structure modeling using evolutionary models. Another direction will focus on the development of phylogenetically compressed variants of modern indexing structures, such as those based on de Bruijn graphs, Bloom filters, and sketches. Finally, an important aspect of the thesis will be efficient implementations of the proposed algorithms and experimental evaluations using some of the largest existing genome databases. The thesis is expected to provide novel computational solutions to the current challenges and its results may have a direct impact in a range of applications, starting from search engines for sequence data to infectious disease diagnostics in remote locations.

Bibliographie

[1] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” J. Mol. Biol., vol. 215, no. 3, pp. 403–410, Oct. 1990.

[2] P.-R. Loh, M. Baym, and B. Berger, “Compressive genomics,” Nat. Biotechnol., vol. 30, no. 7, pp. 627–630, Jul. 2012.

[3] Y. W. Yu, N. M. Daniels, D. C. Danko, and B. Berger, “Entropy-scaling search of massive biological data,” Cell Syst, vol. 1, no. 2, pp. 130–140, Aug. 2015.

[4] K. Břinda et al., “Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression,” bioRxiv, Apr. 2023, doi: 10.1101/2023.04.15.536996.

Liste des encadrants et encadrantes de thèse

BRINDA, Karel

Type d'encadrement

2e co-directeur.trice (facultatif)

Unité de recherche

IRISA

Equipe

GENSCALE

PETERLONGO, Pierre

Type d'encadrement

Directeur.trice de thèse

Unité de recherche

IRISA

Equipe

GENSCALE

Contact·s

Nom

BRINDA, Karel

karel.brinda@inria.fr

Mots-clés

bioinformatics, data structures, indexing, compression, data search