Generation of the context-free grammars

Generation of the Grammars, Training
and Testing Examples

This page is divided into the following sections:

Requirements
Background Theory
Measuring the complexity of the learning task
Creation of the Target Grammars
Construction of training and testing sets.

Requirements

The target grammars, training and testing sentences were created with the following objectives in mind:

The learning task should be sufficiently difficult. Specifically the task should be outside of the current state of the art, but not so difficult that it is extremely unlikely that a winner will be found.
It should be provable that the training sentences are sufficient to identify the target language.

Background Theory

From (Gold 1967 ) it is known that it is impossible to learn a context-free grammar from positive examples only without reference to a statistical distribution, however:

It is possible however to learn a context-free grammar if both positive and negative examples are available.
If sufficient additional prior knowledge is known such as a statistical distribution it is possible to learn a context-free grammar with positive data only.

Therefore it was decided to include problems that included learning grammars from positive examples as well as learning grammars from positive as well as negative examples.

To determine whether or not requirement1 was met, a measure of the complexity of the learning task was derived. The measure was derived by creating a model of the learning task based upon a brute force search (The BruteForceLearner algorithm). This model was also used to determine if the training data was sufficient to identify the target grammar. This model of learning including the derivation of the complexity formulae is defined here .

The summary of these proofs is that

For each context-free grammar G there exists a set of positive examples that when presented to the BruteForceLearner algorithm, causes the BruteForceLearner algorithm to construct a set of grammars X such that there exists a grammar G ₂ in X such that L(G) = L(G ₂ ). We call this set the characteristic set, and use the notation O(G) to define the characteristic set of G.
Given G there exists a simple technique to construct O(G). This technique involves generating a sentence s for each rule P, such that all derivations of s are derived using P. This technique was used in the Omphalos competition to construct some of the training examples.
In calculating the set X from O(G) the BruteForceAlgorithm constructs a huge number of grammars. The number of grammars that would be constructed is defined by Equation 1 . Therefore Equation 1 is a suitable measure of the complexity of the learning task, and is the measure used in this competition.
Given G and O(G) there exists an additional set of positive and negative examples O ₂ (G) such that when O ₂ (G) is presented to the BruteForceLearner algorithm after O(G), the BruteForceLearner algorithm identifies the target language exactly.
The only known technique for constructing O ₂ (G) given O(G) is not computable in polynomial time therefore if we select a small enough grammar to enable this calculation to be computable in reasonable time, then the learning becomes too simple. Therefore O ₂ (G) was not calculated.

Measuring the complexity of the learning task

For reasons described in the previous section, the following equation was used a measure of the complexity of the learning task. This measure was applied to a characteristic set of the target grammar.

Equation 1

2^((sum of( 2*|Oj|-2)+1)^3 +(sum of( 2*|Oj|-2)+1)*T(O)))

Given this knowledge the following technique was used to construct the training sets for the Omphalos competition;

For each target grammar a set of positive sentences were constructed, such that for every rule in the target grammar, a positive sentence was added to the training set that is derived using that rule.
A set of positive examples were then randomly created from the target grammar such that the longest positive example was shorter than the longest sentence in the characteristic set + 5.
A set of example sentences was created by randomly creating strings up to the maximum length using the symbols of the grammar. The created training sentences were then parsed using the target grammar and labelled as either being positive or negative examples.

The number of training examples was selected to be between 10 and 20 times as large as the characteristic set.

Creation of the Target Grammars

Firstly the literature was reviewed to identify pre-existing benchmark grammatical inference problems. The work of Stolcke (Stolcke 1993 ) and (Nakamura and Matsumoto 2002 ) identified some benchmark problems from Cook, Rosenfield et al. (1976 ) and Hopcroft, Motwani et al. (2001 ) .

Using Equation 1 the complexity of these grammars were measured. A description of these grammars and their complexity measures are listed in Table 1 below . The complexity measure was applied to a characteristic set of the target grammar, rather than the complete training sets.

Table 1 Complexity of Benchmark Inference Problems from Stolcke (Stolcke 1993 ) and (Nakamura and Matsumoto 2002 )

Grammar	Description	Example phrase	Log ₂ (Complexity)	Properties of Language
1	(a2)*	aa, aaaa, aaaaaa	1.34E+03	Regular
2	(ab)*	ab,abab,ababab	2.22E+03	Regular
3	Operator precedence (small)	(a + (b +a))	9.37E+03	Not regular
4	Parentheses	( ), (( )), () (), ()(())	2.22E+03	Not regular
5	English verb agreement (small)	that is a woman, i am there	5.33E+05	Finite
6	English lzero grammar	a circle touches a square, a square is below a triangle	4.17E+06	Finite
7	English with clauses (small)	a cat saw a mouse that saw a cat	6.59E+05	Not regular
8	English conjugations (small)	the big old cat heard the mouse	9.13E+05	Regular
9	Regular expressions	a b * ( a )*	9.39E+03	Not regular
10	{ w = w R , w Î {a,b}{a,b}+}	a a a, b a	6.90E+04	Not regular
11	Number of a’s = number of b’s	a a b b a a	4.29E+04	Not regular
12	Number of a’s = 2 ´ number of b’s	a a b, b a b a a a	3.01E+05	Not regular
13	{ w w \| w Î {a,b}+}	a b a, a a	9.12E+04	Regular
14	Palindrome with end delimiter	a a b b $, a b $, b a a b$	1.18E+05	Not regular
15	Palindrome with centre mark	a c a, a b c b a	4.96E+03	Not regular
16	Even length palindrome	a a, a b b a	9.30E+03	Not regular
17	Shape grammar	d a, b a d a	2.45E+04	Not regular

Using the results of Table 1 the complexity of the target grammars was selected. The grammars where then created as follows.

The number of non-terminals, terminals and rules was selected using the complexity of the grammars in Table 1 as a guide.
A set of terminals and non-terminals were created. Rules were then created by randomly selecting terminals and non-terminals. A fixed number of rules were created to contain only terminal strings.
Useless rules were then identified. If a non-terminal could not generate a terminal string, a terminal rule was added to it. If a non-terminal was not reachable from the start symbol, rules were added to ensure the rule was reachable from the start symbol.
Additional rules were added to ensure that the grammar did not represent a regular language.
A characteristic set of sentences was generated for the grammar. If the complexity of the grammar was not in the desired range, then the grammar was deleted.

Using this technique ten grammars were created as listed below in Table 2 . Tests were undertaken to ensure that grammars 1-4 and 7-8 represented deterministic languages. To ensure that grammars 4,5,6,9 and 10 represented a non-deterministic language rules were added to the target grammars. It should be noted that these grammars are complex enough that they cannot be learnt using a brute force technique in time to be entered into the Omphalos competition. Having said that even the smallest of grammars could not be inferred using the BruteForceLearner .

Table 2 Complexities of Benchmark Inference Problems in Omphalos Competition.

Exercise	Training examples available	Properties of Language	Log ₂ (Complexity)
1	Positive & negative	Not regular, deterministic	1.10E+09
2	Positive only	Not regular, deterministic	7.12E+08
3	Positive & negative	Not regular, deterministic	1.65E+10
4	Positive only	Not regular, nondeterministic	1.13E+10
5	Positive & negative	Not regular, nondeterministic	5.46E+10
6	Positive only	Not regular, nondeterministic	6.55E+10
7	Positive & negative	Not regular, deterministic	5.88E+11
8	Positive only	Not regular, deterministic	1.63E+11
9	Positive & negative	Not regular, nondeterministic	1.08E+12
10	Positive only	Not regular, nondeterministic	9.92E+11

The grammars listed in Table 2 represent two axes of difficulty in grammatical inference specifically

· The complexity of the underlying problem

· The amount of training data.

The competition has adopted a linear ordering for the benchmark problems based upon the complexity of the problems, and whether or not negative data is used to infer the grammar. For each level of complexity there are two problems. Solving the problem without using negative data is considered to be a more difficult problem. Therefore if both problems are solved, the competitor who has solved the problem without negative data will be declared the winner. In addition it has been noted by Nakamura that the inference of nondeterministic languages is a more difficult problem. This makes perfect sense given that knowing that the grammar is deterministic reduces the search space of the inference problem.

Construction of training and testing sets.

Once the target grammars were constructed characteristic sets were then constructed for each grammar. A set of positive examples were then created using the GenRGenS software (Denise, Sarron et al. 2001 ) . Additional examples were then created by randomly generating sentences of length less that 5+ the length of the longest sentence in the characteristic set. These sentences were then parsed using the target grammar and were labelled as being either in or out of the target language. This set of sentences was then randomised and split into testing and training sets, but in such a way as to ensure that the training set contained a characteristic set. For problems 2, 4 and 6 the training sets had all negative examples removed.

References

Cook, G. M., A. Rosenfield, et al. (1976). "Grammatical inference by Hill Climbing." Information Sciences 10 : 59-80.
Denise, A., F. Sarron, et al. (2001). GenRGenS. Paris, Laboratoire de Recherche en Informatique. http://www.lri.fr/~denise/GenRGenS
Gold, E. M. (1967). "Grammar identification in the limit." Information and Control 10 (5): 447-474.
Hopcroft, J. E., R. Motwani, et al. (2001). Introduction to automata theory, languages, and computation . Boston, Addison-Wesley.
Nakamura, K. and M. Matsumoto (2002). Incremental Learning of Context-Free Grammars . Grammatical Inference: Algorithms and Applications; 6th International Colloquium, ICGI 2002, Berlin, Germany: Springer-Verlag.
Stolcke, A. (1993). Boogie. ftp://ftp.icsi.berkeley.edu/pub/ai/stolcke/software/boogie.shar.Z

Omphalos is being organized and administered by:

Brad Starkie , François Coste and Menno van Zaanen

You can contact them for comments and complaints at omphalos@irisa.fr