Modular reference indexing with the de Bruijn graphs: overview and challenges

Séminaire

Date de début

jeu 20/10/2022 - 10:30

Date de fin

jeu 20/10/2022 - 12:00

Lieu

IRISA Rennes

Salle

Aurigny

Orateur

Giulio Pibiri (Università Ca' Foscari Venezia)

Département principal

D7 - Gestion des données et de la connaissance

The de Bruijn graph (dBG) has been an object of study in Bioinformatics for decades, and along with its many variants, has recently proven an effective tool in both reference-based and reference-free sequence indexing. In the former case, a fundamental operation is the “locate” query, where one seeks to recall the set of reference positions for each k-mer. Such indexes can serve as the basis for fast mapping/alignment algorithms. In the latter case, one can use the colored dBG as a representation of the k-mer membership among a collection of samples. The index can be used to answer the likely membership of a query string among the indexed samples — the so-called “experiment discovery” problem.

In this talk we overview a modular framework for constructing reference-based and reference-free dBG indexes.

We introduce the concept of reference tilings for reference-based indexes, which encode how “tiles” (e.g., k-mers, unitigs, or simplitigs) spell out the constituent references. We show how the index can be viewed as the composition of two separate mappings: (1) a k-mer-to-tile mapping and (2) a tile-to-reference mapping. We therefore foresee that any algorithmic effort for the reference indexing problem should be devoted to improving the efficiency of these two modular components.

While several efficient solutions are already available for the first mapping, options to represent the second mapping have still to be investigated in depth. This is a challenging data structuring problem since, as reference indexes grow in size, the tile-to-reference mapping tends to dominate the overall index size. Options to be explored might include: custom compression of reference positions, sampling schemes, and approximate schemes with false positives.