Grammatical Inference Benchmarks Repository

Some public positive and negative training samples for grammatical inference,  let me know if you know others...



NEW : Two new competitions ! A competition on learning context-free language for IGGI'04 and a competition on learning DFA from noisy samples for GECCO 2004 !

Quick links :



 Artificial data

Automata inference

      o     Let's begin by an historical aspect. Do you know any GI benchmark prior to that of M. Tomita [T82] ?
Examples are on a two letters alphabet, and the size of the target DFA's is from 1 to 4 states.
The DFA's have been chosen to exhibit various configurations, the corresponding languages are:
  • L1 : a*
  • L2 : (ab)*
  • L3 : any sentence without an odd number of consecutive a's after an odd number of consecutive b's.
  • L4 : any sentence over the alphabet a,b without more than two consecutive a's.
  • L5 : any sentence with an even number of a's and an even number of b's.
  • L6 : any sentence such that the number of a's differs from the number of b's by 0 modulo 3.
  • L7 : a*b*a*b*
  • I have found the samples set in a not so old paper by A. Blair and J. Pollack [BP97] dealing with language induction with second order recurrent neural networks...  You can found here these samples in the Abbadingo One format.
     
    o     This benchmark has been progressively completed by L. Miclet and C. de Gentile [MG94] and by P. Dupont [D94]. The new languages are :
  • L8 : a*b
  • L9 : (a* + c*)b      (3 letters !)
  • L10 : (aa)*(bbb)*
  • L11 : any sentence with an even number of a's and an odd number of b's.
  • L12 : a(aa)*b
  • L13 : any sentence over the alphabet a,b with an even number of a's.
  • L14 : (aa)*ba*
  • L15 : bc*b + ac*a      (3 letters !)
  • I have got the set of training samples used in [D94]. They have been generated by random walks in the canonical DFA of L and its complement:
    (1) I+ consists of a set of strings randomly generated from the canonical DFA of L, with the property that I+ is structurally complete. Let |I+|c be the sample size required to randomly get such a sample. Depending on the value of a parameter denoted M , we pursue the random generation until the sample size |I+| is equal to M . |I+|.

    (2) I-  is constructed in the same way but from the complementary automaton, i.e. the automaton accepting Sigma* - L.

    For each language, 20 training sets have been randomly generated according to the protocol described above. The resulting samples range from 1 to 156 strings whereas their length varies from 1 to 1017 letters.  Samples 01 to 10 have been generated for M = 1 whereas samples 11 to 20  have been generated for M = 3. Therefore, the first 10 samples are smaller than the 10 last samples which are more likely to be characteristic... This is my favorite benchmark to debug my programs (ooh, do I ?).  You can get the compressed version of all the samples ("targzipped") or choose between the samples.
    o     You can also have a look at A. Oliveira and J. Silva benchmark. Here is their presentation of the benchmark generated to evaluate BIC Algorithm [OS97] :
    115 finite state machines with binary inputs and outputs were randomly generated. These finite state machines were reduced and unreachable states were removed before the experiments were run. The size of the machines (after reduction) varied between 3 and 19 states. A total of 575 training sets were generated, with each training set containing twenty strings of length 30.
    The benchmark is available in A. Oliveira's home page, (direct link to the targzipped file in abbadingo format).
    If the previous link doesn't work try this one.
    o     And don't forget the Abbadingo One problems. It is composed of 12 randomly generated DFA's whose size states after reduction between 63 and 519 states. The 1 521 to 60 000 strings training  sets were drawn without replacement from a uniform distribution over the set of all strings not longer than 5 plus the target DFA depth. Challenging isn't it? The competition is now over but you can still download the challenge problems. The oracle is still answering your queries... And if you solve problem T let us know about it  :o)

    o     A follow-on to Abbadingo One is the on demand problem server: Gowachin. The same generation procedure than in Abbadingo is used, but you can specify your own size and number of sequences parameters. Introduction of noise on labels and symbols have also been added...

    o  A follow-on to Gowachin is the Learning DFA from noisy samples contest for GECCO 2004 (The closing date for this contest is Friday May 28th 2004). This contest focuses on a fairly small DFA (nominal 50 states), but with a high level of label noise (10%), and a moderate training sample size (5,000). The programs in this contest will have to run on the same machine within the same time constraints: 10 minutes CPU time (on a 2.4ghz Pentium 4 running Windows or Linux)!

    Context-free inference

    NEW o A new competition on learning context-free languages has been set for the next ICGI to be held in Athens in October 2004. Theoritical learning results are not encouraging but practically, what can be done ? More on the Omphalos site !


    Molecular Biology

    o  I've made a page on the Prediction of the Subcellular Location of Proteins  task based on A. Reinhardt and T. Hubbard's data containing a translation of the data in Abbadingo format and gathering usefull links on the subject... An easy way to test your favorite algorithm on a real task!

    o  3 databases at the famous UCI Repository of Machine Learning Databases :
    [DIR]  promoter-gene-sequences
    [DIR] protein-secondary-structure
    [DIR] splice-junction-gene-sequences


    References :

    [T82]

    M. Tomita, Dynamic Construction of Finite Automata From Examples Using Hill Climbing,
    Proc. of the 4th Annual Cognitive Science Conference, USA, pp. 105- 108, 1982.
    [D94]
    P.Dupont, Regular Grammatical Inference from Positive and Negatives Samples by Genetic Search : the GIG method,
    Lecture Notes in Artificial Intelligence, No. 862, Springer Verlag, Grammatical Inference and Applications, ICGI'94, pp 236--245, 1994.
    [MG94]
    L. Miclet and C. de Gentile, Inférence Grammaticale à partir d'Exemples et de Contre-Exemples : deux Algorithmes Optimaux (BIG et RIG) et une Version Heuristique (BRIG), Actes des JFA-94, Strasbourg, France, pp. F1-F13, 1994.
    [BP97]
    A.D. Blair and J.B. Pollack,  Analysis of Dynamical Recognizers.
    Neural Computation 9(5), 1127-1142, 1997.
    [OS97]
    A. Oliveira and J. Silva. Efficient Search Techniques for the Inference of Minimum Size Finite Automata ,
    Workshop on  Automata Induction, Grammatical Inference, and Language Acquisition
    The Fourteenth International Conference on Machine Learning (ICML-97) July 12, 1997, Nashville, Tennessee

    Comments and remarks are welcome
    Back to my home page