François COSTE


Research scientist (CR1), Symbiose team

INRIA Rennes-Bretagne Atlantique

Address   : Symbiose, IRISA,
            Campus de Beaulieu,
            35042 Rennes Cedex,
Phone     : (33|0) 2 99-84-74-91
Secretary : (33|0) 2 99-84-73-34
Fax       : (33|0) 2 99-84-71-71

Main research topic:

Learning grammars and application to linguistic modelling of biological sequences

Keywords: Grammatical Inference, Machine Learning, Protein Structures and Functions, DNA...
I gave a tutorial on this subject at the tutorial day organized for the 10th anniversary of ICGI (ICGI'10). Here are the slides (6.3M)  and the related bibliography.


  • Protomata Learner infers automata to model families of protein sequences. You can use it through a web interface on the Genouest Bioinformatics platform server. Here are some slides (4.4M) of its presentation at Gen2bio 2008.

    We are working on a new version which will be soon available: stay tuned!

Grammatical Inference Benchmarks and Competitions

  • I am making up a grammatical inference benchmarks repository (GIB): don't hesitate to This e-mail address is being protected from spam bots, you need JavaScript enabled to view it with your own data sets, especially real world ones !.
  • I am maintaining the Gowachin server, a continuation of the Abbadingo One DFA learning competition, allowing to generate parametrized problems. I have also co-organized Omphalos, the competition on learning context-free languages, which is now over but the data sets are still available...
    If you are interested in grammatical inference competitions, you should have a look at the 2010 competitions: Zulu and Stamina


PhD Students

  • Gaelle Garet, Discovery of enzymatic functions in the framework of formal languages (with Jacques Nicolas).
Former PhD Students:


I am currently involved in the following projects:

  • ANR LepidOLF: Microgénomique de la sensille phéromonale d’un lépidoptère : une approche novatrice pour comprendre les mécanismes olfactifs et leur modulation
  • ANR Pelican : Competing for light in the ocean: An integrative genomic approach of the ecology, diversity and evolution of cyanobacterial pigment types in the marine environment
  • Collaboration MINCyT (ex SECyT) - INRIA with the  "Grupo de Procesamiento de Lenguaje Natural " of Gabriel Infante-Lopez: Modélisation linguistique de séquences génomiques par apprentissage de grammaires

Previous projects:

  • ANR Proteus: Reconnaissance de pli et repliement inverse : vers une prédiction à grande échelle des structures de protéines
  • ANR Modulome: Deciphering and modelling the structural organization of genomes





Selected publications

(A more exhaustive list is available here )

  • Searching for Smallest Grammars on Large Sequences and Application to DNA,
    Rafael Carrascosa, François Coste, Matthias Gallé, Gabriel Infante-Lopez,
    Journal of Discrete Algorithms, in press, available online, 2011

  • The Smallest Grammar Problem as Constituents Choice and Minimal Grammar Parsing
    Rafael Carrascosa, François Coste, Matthias Gallé, Gabrie Infante-Lopez
    , 4 (2011) 262-284

    extended and more formal version of the paper presented at LATA 2010 Choosing word occurrences for the smallest grammar problem

  • Modelling Biological Sequences by Grammatical Inference,
    François Coste,
    ICGI 2010 Tutorial Day
  • In place update of suffix array while recoding words,
    Matthias Gallé, Pierre Peterlongo and François Coste
    International Journal of Foundation of Computer Science,  vol. 20, Issue 6, 2009, pp. 1025-1045
    abstract, paper
    extended version of paper presented at PSC 2008 (abstract, paper, slides)
    supplementary material (code, data sets, experiments)
  • Learning Automata on Protein Sequences, François Coste and Goulven Kerbellec, JOBIM 2006.
    abstract, paper, slides (.pdf)
  • A Similar Fragments Merging Approach to Learn Automata on Proteins , François Coste and Goulven Kerbellec, ECML 2005.
    abstract, paper, extended version, data sets
    Some recent slides presenting this work and more at a grammatical inference workshop: slides, 4 per pages for printing
  • Progressing the State-of-the art in Grammatical Inference by Competition, Brad Starkie, François Coste and Menno van Zaanen, AI Communications, vol. 18, no. 2, 2005, pp. 93-115.
    abstract, paper, slides (.ppt) presented at ICGI 2004
  • Introducing Domain and Typing Bias in Automata Inference, François Coste, Daniel Fredouille, Christopher Kermorvant and Colin de la Higuera. ICGI 2004.
    abstract, paper, slides (.ppt, 2.2MB)
  • Mutually compatible and incompatible merges for the search of the smallest consistent DFA, John Abela, François Coste and Sandro Spina. ICGI 2004.
    abstract, paper, slides (.ppt)
  • What is the Search Space for the Inference of Non Deterministic, Unambiguous and Deterministic Automata ? François Coste, Daniel Fredouille, Techn. Report, RR-4907, 2003
  • Efficient ambiguity detection in C-NFA, a step toward inference of non deterministic automata , François Coste, Daniel Fredouille, ICGI 2000, Grammatical inference: algorithms and applications, Lisbonne , 25-38 , september , 2000
    paper (ps.gz) benchmark (.tar.gz).
    Classification ambiguity!

Ph.D. Thesis

Apprentissage d'automates classifieurs en inférence grammaticale, IRISA/Université de Rennes 1, 27 janvier 2000.
Advisor: Jacques Nicolas.
abstract (English and French) , thesis (.ps.gz, .pdf, errata), slides ( .ps.gz, .pdf).

