Vous êtes ici

Search engine for genomic sequencing data

Equipe et encadrants
Département / Equipe: 
Site Web Equipe: 
Directeur de thèse
Pierre Peterlongo
Co-directeur(s), co-encadrant(s)
Sujet de thèse

The main objective is to produce a model and a prototype dedicated to allowing users to directly query large unassembled raw sequencing data on the fly in order to tap into the largest underexploited resource in life sciences.

We are currently witnessing a deep knowledge revolution due to the availability of exponentially expanding sequence databases made possible by the continuously accelerating throughput of sequencing techniques. Sequencing data is accumulating faster than Moore’s Law, bringing fundamental new insights, conjecture, and understanding, with impacts in medicineagronomy and ecology. Today, the INSDC SRA raw data archive stores more than 1016 (10 000 PB) nucleotides, in the form of short sequences (<1000 PB) which represent fragments from generally unknown genomic location (the “reads”). However, the overwhelming majority of those sequences have only been analyzed within the context of single projects addressing each a small fraction of the total resource. It is therefore of primary importance to maintain this trace of diversity for future studies and to develop technologies to interrogate these data. Moreover, providing fast access to the sum of all data would open the doors to novel discoveries that a single or a limited number of read sets do not have the power to address.

Assignments :
The recruited person will be taken to design and propose new indexing scheme, scaling up very large DNA collection (assembled or not), and offering a way to query in real time input sequences of interest. There exist methods such as Sequence Bloom Tree and as Bloom Filter Trie, that index and compress (lossless or not) such banks. In this project, we will explore the novel idea of representing the bank in a global incremental compressed index using a graph representation of all corrected reads from the whole bank read sets.  


Solomon, B., & Kingsford, C. (2016). Fast search of thousands of short-read sequencing experiments. Nature Biotechnology, (April 2015), 1–6. http://doi.org/10.1038/nbt.3442

Holley, G., Wittler, R., & Stoye, J. (2016). Bloom Filter Trie: An alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology, 11(1). http://doi.org/10.1186/s13015-016-0066-8

 Marchet, C., Lecompte, L., Limasset, A., Bittner, L., & Peterlongo, P. (2017). A resource-frugal probabilistic dictionary and applications in bioinformatics, 1–16. http://arxiv.org/abs/1605.08319

 Bradley, P., Bakker, H. den, Rocha, E., McVean, G., & Iqbal, Z. (2017). Real-time search of all bacterial and viral genomic data. bioRxiv, 234955. http://doi.org/10.1101/234955


Mots clés: 
indexation, Algorithmic on strings. Data-structures, NGS, TGS
IRISA - Campus universitaire de Beaulieu, Rennes