|
|
|
Mardi 15 Février, 14h - Soutenance de Thèse de Matthias Gallé |
|
|
Written by Pierre PETERLONGO
|
Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem14h Amphithéatre, INRIA Rennes
Motivated by the goal of discovering hierarchical structures inside
DNA sequences, we address the Smallest Grammar Problem, the problem of
finding a smallest context-free grammar that generates exactly one
sequence. This NP-Hard problem has been widely studied for applications
like Data Compression, Structure Discovery and Algorithmic Information
Theory.
We propose a formalisation of the Smallest Grammar Problem based on
two complementary optimisation problems: the choice of constituents of
the final grammar and the choice of how to parse the sequence with these
constituents. We name this last problem the ``Minimal Grammar Parsing''
problem and give a polynomial solution for it. This decomposition
allows us to define a new complete and correct search space for the
Smallest Grammar Problem. Based on this search space, we propose new
algorithms able to return grammars 10% smaller than the state of the
art on complete genomes.
Special care has been taken about the efficiency of the algorithms by
considering different equivalence classes of repeats and by introducing
an efficient in-place schema to update the suffix array data structure
used to compute these words.
Regarding the mentioned applications, we consider the impact of the
non-uniqueness of smallest grammars on Structure Discovery. We prove
that the number of smallest grammars can be exponential in the size of
the sequence and then analyse the stability of the discovered structures
between minimal grammars for real-life examples. With respect to Data
Compression, we extend our algorithms to use rigid patterns as words and
achieve compression rate up to 25\ better compared to the previous
best DNA grammar-based coder.
Jury:
Alberto Apostolico (rapporteur)
Alexander Clark (rapporteur)
François Coste (examinateur)
Gabriel Infante-Lopez (directeur de thèse)
Jacques Nicolas (directeur de thèse)
Eric Rivals (examinateur)
|
|