Examples are on a two letters alphabet, and the size of the target DFA's is from 1 to 4 states.
The DFA's have been chosen to exhibit various configurations, the corresponding languages are:I have found the samples set in a not so old paper by A. Blair and J. Pollack [BP97] dealing with language induction with second order recurrent neural networks... You can found here these samples in the Abbadingo One format.
L1 : a* L2 : (ab)* L3 : any sentence without an odd number of consecutive a's after an odd number of consecutive b's. L4 : any sentence over the alphabet a,b without more than two consecutive a's. L5 : any sentence with an even number of a's and an even number of b's. L6 : any sentence such that the number of a's differs from the number of b's by 0 modulo 3. L7 : a*b*a*b*
This benchmark has been progressively completed by L. Miclet and C. de Gentile [MG94] and by P. Dupont [D94]. The new languages are :I have got the set of training samples used in [D94]. They have been generated by random walks in the canonical DFA of L and its complement:
L8 : a*b L9 : (a* + c*)b (3 letters !) L10 : (aa)*(bbb)* L11 : any sentence with an even number of a's and an odd number of b's. L12 : a(aa)*b L13 : any sentence over the alphabet a,b with an even number of a's. L14 : (aa)*ba* L15 : bc*b + ac*a (3 letters !)(1) I+ consists of a set of strings randomly generated from the canonical DFA of L, with the property that I+ is structurally complete. Let |I+|c be the sample size required to randomly get such a sample. Depending on the value of a parameter denoted M , we pursue the random generation until the sample size |I+| is equal to M . |I+|.For each language, 20 training sets have been randomly generated according to the protocol described above. The resulting samples range from 1 to 156 strings whereas their length varies from 1 to 1017 letters. Samples 01 to 10 have been generated for M = 1 whereas samples 11 to 20 have been generated for M = 3. Therefore, the first 10 samples are smaller than the 10 last samples which are more likely to be characteristic... This is my favorite benchmark to debug my programs (ooh, do I ?). You can get the compressed version of all the samples ("targzipped") or choose between the samples.
(2) I- is constructed in the same way but from the complementary automaton, i.e. the automaton accepting Sigma* - L.
You can also have a look at A. Oliveira and J. Silva benchmark. Here is their presentation of the benchmark generated to evaluate BIC Algorithm [OS97] :
115 finite state machines with binary inputs and outputs were randomly generated. These finite state machines were reduced and unreachable states were removed before the experiments were run. The size of the machines (after reduction) varied between 3 and 19 states. A total of 575 training sets were generated, with each training set containing twenty strings of length 30.The benchmark is available in A. Oliveira's home page, (direct link to the targzipped file in abbadingo format).
If the previous link doesn't work try this one.
And don't forget the Abbadingo One problems. It is composed of 12 randomly generated DFA's whose size states after reduction between 63 and 519 states. The 1 521 to 60 000 strings training sets were drawn without replacement from a uniform distribution over the set of all strings not longer than 5 plus the target DFA depth. Challenging isn't it? The competition is now over but you can still download the challenge problems. The oracle is still answering your queries... And if you solve problem T let us know about it :o)
A follow-on to Abbadingo One is the on demand problem server: Gowachin. The same generation procedure than in Abbadingo is used, but you can specify your own size and number of sequences parameters. Introduction of noise on labels and symbols have also been added...
A follow-on to Gowachin is the Learning DFA from noisy samples contest for GECCO 2004 (The closing date for this contest is Friday May 28th 2004). This contest focuses on a fairly small DFA (nominal 50 states), but with a high level of label noise (10%), and a moderate training sample size (5,000). The programs in this contest will have to run on the same machine within the same time constraints: 10 minutes CPU time (on a 2.4ghz Pentium 4 running Windows or Linux)!
M. Tomita, Dynamic Construction of Finite Automata From Examples Using Hill Climbing,[D94]
Proc. of the 4th Annual Cognitive Science Conference, USA, pp. 105- 108, 1982.
P.Dupont, Regular Grammatical Inference from Positive and Negatives Samples by Genetic Search : the GIG method,[MG94]
Lecture Notes in Artificial Intelligence, No. 862, Springer Verlag, Grammatical Inference and Applications, ICGI'94, pp 236--245, 1994.
L. Miclet and C. de Gentile, Inférence Grammaticale à partir d'Exemples et de Contre-Exemples : deux Algorithmes Optimaux (BIG et RIG) et une Version Heuristique (BRIG), Actes des JFA-94, Strasbourg, France, pp. F1-F13, 1994.[BP97]
A.D. Blair and J.B. Pollack, Analysis of Dynamical Recognizers.[OS97]
Neural Computation 9(5), 1127-1142, 1997.
A. Oliveira and J. Silva. Efficient Search Techniques for the Inference of Minimum Size Finite Automata ,
Workshop on Automata Induction, Grammatical Inference, and Language Acquisition
The Fourteenth International Conference on Machine Learning (ICML-97) July 12, 1997, Nashville, Tennessee