Scientific Axes
Activity Report

Mardi 15 Février, 14h - Soutenance de Thèse de Matthias Gallé Print
Written by Pierre PETERLONGO   

Searching for Compact Hierarchical Structures in DNA  by means of the Smallest Grammar Problem

14h Amphithéatre, INRIA Rennes

Motivated by the goal of discovering hierarchical structures inside DNA sequences, we address the Smallest Grammar Problem, the problem of finding a smallest context-free grammar that generates exactly one sequence. This NP-Hard problem has been widely studied for applications like Data Compression, Structure Discovery and Algorithmic Information Theory.

We propose a formalisation of the Smallest Grammar Problem based on two  complementary optimisation problems: the choice of constituents of the final grammar and the choice of how to parse the sequence with these constituents. We name this last problem the ``Minimal Grammar Parsing'' problem and give a polynomial solution for it. This decomposition allows us to define a new complete and correct search space for the Smallest Grammar Problem. Based on this search space, we propose new algorithms able to return grammars 10% smaller than the state of the art on complete genomes.
Special care has been taken about the efficiency of the algorithms by considering different equivalence classes of repeats and by introducing an efficient in-place schema to update the suffix array data structure used to compute these words.

Regarding the mentioned applications, we consider the impact of the non-uniqueness of smallest grammars on Structure Discovery. We prove that the number of smallest grammars can be exponential in the size of the sequence and then analyse the stability of the discovered structures between minimal grammars for real-life examples. With respect to Data Compression, we extend our algorithms to use rigid patterns as words and achieve compression rate up to 25\ better compared to the previous best DNA grammar-based coder.

Alberto Apostolico (rapporteur)
Alexander Clark (rapporteur)
François Coste (examinateur)
Gabriel Infante-Lopez (directeur de thèse)
Jacques Nicolas (directeur de thèse)
Eric Rivals (examinateur)

< Prev   Next >

Symbiose Project Team - INRIA/Irisa © 2007 - 2008