
Mardi 15 Février, 14h  Soutenance de Thèse de Matthias Gallé 

Written by Pierre PETERLONGO

Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem14h Amphithéatre, INRIA Rennes
Motivated by the goal of discovering hierarchical structures inside
DNA sequences, we address the Smallest Grammar Problem, the problem of
finding a smallest contextfree grammar that generates exactly one
sequence. This NPHard problem has been widely studied for applications
like Data Compression, Structure Discovery and Algorithmic Information
Theory.
We propose a formalisation of the Smallest Grammar Problem based on
two complementary optimisation problems: the choice of constituents of
the final grammar and the choice of how to parse the sequence with these
constituents. We name this last problem the ``Minimal Grammar Parsing''
problem and give a polynomial solution for it. This decomposition
allows us to define a new complete and correct search space for the
Smallest Grammar Problem. Based on this search space, we propose new
algorithms able to return grammars 10% smaller than the state of the
art on complete genomes.
Special care has been taken about the efficiency of the algorithms by
considering different equivalence classes of repeats and by introducing
an efficient inplace schema to update the suffix array data structure
used to compute these words.
Regarding the mentioned applications, we consider the impact of the
nonuniqueness of smallest grammars on Structure Discovery. We prove
that the number of smallest grammars can be exponential in the size of
the sequence and then analyse the stability of the discovered structures
between minimal grammars for reallife examples. With respect to Data
Compression, we extend our algorithms to use rigid patterns as words and
achieve compression rate up to 25\ better compared to the previous
best DNA grammarbased coder.
Jury:
Alberto Apostolico (rapporteur)
Alexander Clark (rapporteur)
François Coste (examinateur)
Gabriel InfanteLopez (directeur de thèse)
Jacques Nicolas (directeur de thèse)
Eric Rivals (examinateur)

