You are here

Search engine for genomic sequencing data

Team and supervisors
Department / Team: 
Team Web Site: 
https://team.inria.fr/genscale/
PhD Director
Peterlongo Pierre
Co-director(s), co-supervisor(s)
Contact(s)
PhD subject
Abstract

Recent technological revolutions in genome sequencing offered an unprecedented chance to access the genome of living organisms and to subtly understand life mechanisms, at the genomic level. Those technologies daily generate terabytes of sequence data, that are finally stored in public data banks. The International Nucleotide Sequence Database Collaboration Sequence Read Archive (SRA) contains nowadays 10 petabytes characters, and these numbers double every two years. Such data banks house genomic resources from all kind of living organisms, from viruses to large mammalian species. Nevertheless, although the richness of this information is invaluable, it is currently impossible to request these databases. Provide a mean to efficiently query these sequencing data thus appears essential to really understand genomic life mechanisms, with a global viewpoint.

The main scientific hypothesis relies on alignment-free approaches, based on k-mer (word of length *k*) frequencies that provide fast approximations of sequence similarities. The key strategy relies on the development of a novel data structure for indexing k-mers, designed to have a fast key lookup and an extremely low memory footprint. This goal will be achieved by making use of probabilistic data structures such as the  [1] .  One approach to accomplish this objective is to apply our expertise regarding highly efficient minimal perfect hash functions [2] and to factorize redundant information shared between indexed read sets, following ideas proposed in the SSBT [3] approach. Additionally, efforts will be devoted towards filtering indexed data.
During the course of the PhD, we shall endeavour to produce working prototypes suitable for large-scale deployment. We shall deploy these prototypes within a production environment at EMBL-EBI. Validations and proof of concept will be performed on published Tara Oceans metagenomic and metatranscriptomic read sets, representing more than 500 billion reads (50 Tb).

 

Bibliography

[1] Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426.

[2] Limasset, A. et al. , Fast and scalable minimal perfect hashing for massive key sets SEA, 2017

[3] Solomon, B., & Kingsford, C. (2016). Fast search of thousands of short-read sequencing experiments. Nature biotechnology, 34(3), 300.

Work start date: 
01/10/2019
Keywords: 
Datastructure, algorithms, index, genomic, metagenomic, C++
Place: 
IRISA - Campus universitaire de Beaulieu, Rennes