Home
Scientific Axes
Members
Publications
Software
Collaborations
Activity Report
Seminars
Positions





Print

Modelling sequence/structure relationships


Two lines of research are carried out within the framework of linguistic analysis of sequences. The first one aims at the expression of complex models on sequences. This will serve biologists to both validate his/her model with respect to whole genomes or set of sequences and to find new candidates in sequence data banks.
The second complementary line of research aims at the automatic inference of such models from a set of sequences sharing a functional or structural property. In addition to finding new candidates, the goal is then to assist the biologist to gain new knowledge on the sequence family by the discovery of explicit models showing up the important parts of the sequences and their organization.

Finding modules in sequences


Participants: Rumen Andonov, François Coste, Jacques Nicolas, Dominique Lavenier, Israël-César Lerman, Anne Siegel, Basavanneppa Tallur, Sebastien Tempel, Phillipe Veber, Christine Rousseau, Grégory Ranchy.

J. Nicolas coordinates the national ANR project Modulome that aims at modelling the structure of genomes in terms of assembly of «modules» that may be copied and move inside or between genomes. This is supported by three applications on genomic mobile elements in cooperation with URGI/Inra Versailles, LME/Ifremer Brest and LEPG/CNRS Tours. We have worked this year on the characterization of two types of modules. Corresponding softwares have been transferred on the bioinformatics platform.
The first study concerned the analysis of the dynamics of transposable element Helitron, a recently discovered type of transposon most widely spreaded in the genome of Arabidopsis thaliana. It has been concluded by the defense of S. Tempel’s PhD. The thesis studies three relations between helitrons and their host genome: their mode of invasion, the modularity of their internal sequence and their impact on close genes. We have produced a formal grammatical model of these elements, made of two extremities separated by a highly variable sequence of bounded size that led to the proposition of a new nomenclature of helitrons in Arabidopsis based on their extremities instead of their global sequence. A method and a software, DomainOrganizer, allows to establish the domains’ composition. It detects first domains’ borders from a multiple alignment and provides a list of putative domains. From this list it searches an optimal coding of sequences in the family as concatenation of sequences extracted in the list. We formulate the coding of the family as a (discrete) optimization problem, and express it as a 0/1 linear program. Sequences are then clustered and visualized with respect to domains.

   Analysis of CRISPR.
The second study concerned the analysis of an intriguing genetic structure in bacteria and archaebacteria, CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats). These are formed by a repetitive skeleton including genetic material that seems imported from viruses and plasmid. CRISPR are likely to represent a fundamental immune system in prokaryotes. However, a systematic study of this family was in need of relevant models. We have formalized an important and general notion in this respect, locality, which delimits with a single parameter the distribution of occurrences of repeated units in a sequence. We have also improved our genome sequence visualization method, Pygram, which is now able to display the hierarchical organization of repeated structures on multiple genomic sequences and to synchronize the search on these sequences.

Matching sequences with automata and logical grammars


Participants: François Coste, Jacques Nicolas, Catherine Belleannée, Pierre Peterlongo, Thibaut Henin, Goulven Kerbellec, Sébastien Tempel, Jessie Mahé.

    Fractal structures inside DNA.
The Chaos Game representation has been used by several laboratories to get a global view of repeats in genomes. Each DNA sequence generated nice pictures with different non random fractal structures. A fundamental question is to know "where do these structures come from?", and "how can we caracterize them?".
We have shown that observed structures come from the presence of regular languages in DNA sequences. We have also proposed a method to quantify this presence and to characterize these structures with probabilistic automata. We have shown that this structures are in fact sufficient to discriminate genomes. Finally, the deeper understanding of the relation between the image level and the language level led us to introduce some variants on different problems.

    Beyong regular patterns.
An ambitious platform to search for complex models within both DNA and protein sequences is under development. It is based on previous works made within the team in order to propose an expressive language (Stan and Wapam) that goes beyond pattern matching in biological sequences and study modelling needs of biologists at the level of whole genomes. Project’s concerns are expressiveness, efficiency and ergonomy. Wapam has been extended to handle the full expressiveness of automata. This new version, named Wascan will be accessible on the GenOuest bioinformatics platform. It can already be used through the Protomata-Scan interface on the GenOuest platform to scan databases with automata produced by Protomata-Learner.
Our main research focus this year was on the specification of the modelling language. The language, called Logol allows writing a particular form of Definite Clause Grammars, namely String Variable Grammars (SVG) in the line of Searls’ work. We did deeply revisit SVG modelling, notably including in our models the fact that observed sequences are variants that have evolved from a common ancestor. The difficulty of modelling and detecting such repeats is that the reference ancestor sequence may well not appear at all in the analysed sequence. (e.g. AATT may be an observed string made of two copies, AA and TT, of a same ancestor TA). Formally, Logol is a dedicated constrained string language that considers two levels for chains: a level which designates abstract chains, and a level which designate concrete subsequences.

Protein structure alignment: algorithms and applications


Participants: Rumen Andonov, Nicola Yanev, Guillaume Collet, Noël Malod-Dognin, Phillippe Veber.

Two typical problems related with protein structure are Fold Prediction/Recognition and Fold Comparison. By their nature, 3D computational problems are inherently more complex than the similar 1D ones for which efficient solutions have been yet developed. A theoretical basis that can provide rigourous support in understanding models for structure prediction and analysis is almost non-existent, as the problems are blend of continuous, geometric and combinatorial, discrete-mathematics. We focus notably on creating efficient exact algorithms for solving integer programming problems of proved interest: sequence/structure alignment (protein threading problem-PTP), and structure/structure comparison.

    Sequence/structure alignment.
We have been working on PTP since 2001 in a very narrow collaboration with two teams. The first one is MIG from INRA, Jouy-en-Josas, and the second one is the Operation Research Departement from the University of Sofia, Bulgaria whose leader, Prof. Nicola Yanev is regularly visiting Symbiose. MIG created FROST the only french software participating in the prestigious competition CAFASP (Critical Assessment of Fully Automated Structure Prediction) which holds every two years. The dedicated optimization algorithm in this tool is currently based on the Lagrangian approach developed by our group.
The weak point of FROST and other existing methods is that it permits only global alignment of a sequence with a structure, that is, allowing omissions of blocks during the alignment process. Designing algorithms for flexible structures alignment is our current focus of interest. Some ideas to tackle this problem were presented in [link], based on MIP models for semi-global and local sequence/structure alignment.

    Structure/structure comparison.
A multitude of measures have been proposed to quantify the similarity between protein 3-D structure. Among these measures, contact map overlap (CMO) maximization deserved sustained attention during past decade because it offers a fine estimation of the natural homology relation between proteins. Despite this large involvement of the bioinformatics and computer science community, the performance of known algorithms remains modest. Due to the complexity of the problem, they got stuck on relatively small instances and are not applicable for large scale comparison. Our results offer a clear improvement over past methods in this respect. We present a new integer programming model for CMO and propose an exact B&B algorithm with bounds computed by solving Lagrangian relaxation. The efficiency of the approach is demonstrated on a popular small benchmark (Skolnick set, 40 domains). On this set our algorithm significantly outperforms the best existing exact algorithms, and yet provides lower and upper bounds of better quality. Some hard CMO instances have been solved for the first time and within reasonable time limits. From the values of the running time and the relative gap (relative difference between upper and lower bounds), we obtained the right classification for this test. These encouraging result led us to design a harder benchmark to better assess the classication capability of our approach. We constructed a large scale set of 300 protein domains (a subset of ASTRAL database) that we have called Proteus_300. Using the relative gap of any of the 44850 couples as a similarity measure, we obtained a classication in very good agreement with SCOP. Our algorithm provides thus a powerful classication tool for large structure databases.

Learning automata and grammars on protein sequences


Participants: François Coste, Goulven Kerbellec, Matthias Gallé, Pierre Peterlongo.

This year, links between alignment and automata construction have been clarified. We formalized the partial alignment needed to build automata as an independant constrained matching optimization problem and we proposed a new heuristic algorithm to solve it. This new algorithm is faster than the previous heuristic and the semi-exact algorithm : it allows to handle efficiently classical real datasets. We proposed a Minimum Description Length measure to choose the best (non deterministic) automata with respect to the sequences sample. As a result, we are able to propose to the biologists Protomata-Learner, a program for the inference of automata on protein sequences that has been transferred in the GenOuest platform 5.4.
We began also this year a new line of research on learning more expressive grammars such as context-free grammars. Focusing rather on learning the structure than the language, we developed an approach based on recoding repeated elements. To handle large sequences such as genomes, efficient data structures and algorithms have to be used. To detect and score repeats, we used suffix arrays which needed to be regularly updated after each rewriting of repeat. We proposed and implemented efficient in-place algorithms to update suffix-arrays after each word recoding.


 

Symbiose Project Team - INRIA/Irisa © 2007 - 2008