NEW
: Two new competitions ! A competition on learning context-free
language
for IGGI'04 and a competition on learning DFA
from noisy samples for GECCO 2004 !

Quick links :

Quick links :

Examples are on a two letters alphabet, and the size of the target DFA's is from1 to 4 states.

The DFA's have been chosen to exhibit various configurations, the corresponding languages are:I have found the samples set in a not so old paper by A. Blair and J. Pollack [BP97] dealing with language induction with second order recurrent neural networks... You can found here these samples in the Abbadingo One format.L1 : a* L2 : (ab)* L3 : any sentence without an odd number of consecutive a's after an odd number of consecutive b's. L4 : any sentence over the alphabet a,b without more than two consecutive a's. L5 : any sentence with an even number of a's and an even number of b's. L6 : any sentence such that the number of a's differs from the number of b's by 0 modulo 3. L7 : a*b*a*b*

This benchmark has been progressively completed by L. Miclet and C. de Gentile [MG94] and by P. Dupont [D94]. The new languages are :I have got the set of training samples used in [D94]. They have been generated by random walks in the canonical DFA of L and its complement:L8 : a*b L9 : (a* + c*)b ( 3 letters !)L10 : (aa)*(bbb)* L11 : any sentence with an even number of a's and an odd number of b's. L12 : a(aa)*b L13 : any sentence over the alphabet a,b with an even number of a's. L14 : (aa)*ba* L15 : bc*b + ac*a ( 3 letters !)(1) I+ consists of a set of strings randomly generated from the canonical DFA of L, with the property that I+ is structurally complete. Let |I+|c be the sample size required to randomly get such a sample. Depending on the value of a parameter denoted M , we pursue the random generation until the sample size |I+| is equal to M . |I+|.(2) I- is constructed in the same way but from the complementary automaton, i.e. the automaton accepting Sigma* - L.

For each language,20 training setshave been randomly generated according to the protocol described above. The resulting samples range from1 to 156 stringswhereas their length varies from1 to 1017 letters. Samples 01 to 10 have been generated for M = 1 whereas samples 11 to 20 have been generated for M = 3. Therefore, the first 10 samples are smaller than the 10 last samples which are more likely to be characteristic... This is my favorite benchmark to debug my programs (ooh, do I ?). You can get the compressed version of all the samples ("targzipped") or choose between the samples.

You can also have a look at A. Oliveira and J. Silva benchmark. Here is their presentation of the benchmark generated to evaluate BIC Algorithm [OS97] :

The benchmark is available in A. Oliveira's home page, (direct link to the targzipped file in abbadingo format).115 finite state machineswith binary inputs and outputs were randomly generated. These finite state machines were reduced and unreachable states were removed before the experiments were run. The size of the machines (after reduction) varied between3 and 19 states. A total of575 training setswere generated, with each training set containing twenty strings of length 30.

If the previous link doesn't work try this one.

And don't forget the Abbadingo One problems. It is composed of12 randomly generated DFA'swhose size states after reduction between63 and 519 states.The1 521 to 60 000 stringstraining sets were drawn without replacement from a uniform distribution over the set of all strings not longer than 5 plus the target DFA depth. Challenging isn't it? The competition is now over but you can still download the challenge problems. The oracle is still answering your queries... And if you solve problem T let us know about it :o)A follow-on to Abbadingo One is the on demand problem server: Gowachin. The same generation procedure than in Abbadingo is used, but you can specify your own size and number of sequences parameters. Introduction of noise on labels and symbols have also been added...

A follow-on to Gowachin is the Learning DFA from noisy samples contest for GECCO 2004 (The closing date for this contest is Friday May 28th 2004). This contest focuses on a fairly small DFA (nominal 50 states), but with a high level of label noise (10%), and a moderate training sample size (5,000). The programs in this contest will have to run on the same machine within the same time constraints: 10 minutes CPU time (on a 2.4ghz Pentium 4 running Windows or Linux)!

NEW
A
new competition on learning context-free languages has
been set for the next ICGI
to be held in Athens in October 2004. Theoritical learning results are
not encouraging but practically, what can be done ? More on the
Omphalos site !

I've made a page on the
Prediction
of the Subcellular Location of Proteins task based on A.
Reinhardt and T. Hubbard's data containing a translation of the data in
Abbadingo format and gathering usefull links on the subject... An easy
way to test your favorite algorithm on a real task!

3 databases at the famous**UCI**
Repository of Machine Learning Databases :

3 databases at the famous

promoter-gene-sequences

protein-secondary-structure

splice-junction-gene-sequences

M. Tomita, Dynamic Construction of Finite Automata From Examples Using Hill Climbing,[D94]

Proc. of the 4th Annual Cognitive Science Conference, USA, pp. 105- 108, 1982.

P.Dupont, Regular Grammatical Inference from Positive and Negatives Samples by Genetic Search : the GIG method,[MG94]

Lecture Notes in Artificial Intelligence, No. 862, Springer Verlag, Grammatical Inference and Applications, ICGI'94, pp 236--245, 1994.

L. Miclet and C. de Gentile, Inférence Grammaticale à partir d'Exemples et de Contre-Exemples : deux Algorithmes Optimaux (BIG et RIG) et une Version Heuristique (BRIG), Actes des JFA-94, Strasbourg, France, pp. F1-F13, 1994.[BP97]

A.D. Blair and J.B. Pollack, Analysis of Dynamical Recognizers.[OS97]

Neural Computation 9(5), 1127-1142, 1997.

A. Oliveira and J. Silva. Efficient Search Techniques for the Inference of Minimum Size Finite Automata ,

Workshop on Automata Induction, Grammatical Inference, and Language Acquisition

The Fourteenth International Conference on Machine Learning (ICML-97) July 12, 1997, Nashville, Tennessee

Comments and remarks are welcome

Back to my home page