2018-2019 Internship subject: Efficient learning of multiple context-free grammars by local substitutability

Printer-friendly version

Internship subject 2018-2019 : Efficient learning of multiple context-free grammars by local substitutability

Presentation slides

Keywords: Machine learning, Formal Grammars, Algorithmics, Learnability

Subject:

Learning grammars automatically was originally motivated by the problem of natural languages acquisition. It targets now a large range of applications requiring a structured view or characterization of sequences: e. g. genomics, text document processing, software engineering, robotics, protocol testing or intrusion detection...

Learning grammars is a difficult task that has been limited for long to regular grammars (or equivalently automata). The introduction of a new generalization scheme, based on the substitutability of strings appearing in common contexts, enables now to learn some context-free grammars [1]. Motivated by the application of this approach to characterize protein families, we have proposed a new generalization operator based on local contexts, rather than the classical global contexts, and an efficient learning algorithm, with promising experimental results on real protein data sets [2] and nice polynomial time and data learnability properties [3].

Yet, context-free grammars can only model nested dependencies (as in the well-formed parenthesis Dyck language). Learning more expressive grammars is thus needed to model the numerous crossing dependencies seen in protein structures. We propose here to focus on the classes of multiple context-free grammars (MCFG): they can represent such dependencies and have been shown theoretically learnable by substitutability [4,5,6], but not in the polynomial time and data framework usually targeted for practical applications.

In this context, the subject of the internship is to study how the idea of local substitutability could be used to 1) define a more efficiently learnable class of MCFGs to model crossing dependencies and 2) propose an efficient learning algorithm for this class. The interest of the new approach proposed will then be validated by theoretical results (learnability or complexity results) or practical experiments (testing its performances on artificial data or real protein families).

Duration: 5-6 months

Requirement: Computer Science PhD or Master degree study with  courses in formal language theory, algorithmics and machine learning.

Team: Dyliss, IRISA / Inria Rennes-Bretagne Atlantique

Advisor: François Coste, francois.coste@inria.fr, tel (33|0) 299 847 491

Bibliography:

  1. Polynomial identification in the limit of substitutable context-free languages, A Clark, R Eyraud, Journal of Machine Learning Research, 2007
  2. A bottom-up efficient algorithm learning substitutable languages from positive examples, F. Coste, G. Garet and J. Nicolas, International Conference on Grammatical Inference 2014, Kyoto.
  3. Learning local substitutable context-free languages from positive examples in polynomial time and data by reduction, F. Coste, J. Nicolas, International Conference on Grammatical Inference 2018, Wrocław
  4. Efficient learning of multiple context-free languages with multidimensional substitutability from positive data, R. Yoshinaka, Theoretical Computer Science, Volume 412, Issue 19, 2011
  5. Distributional learning of parallel multiple context-free grammars, A. Clark, R. Yoshinaka, Machine Learning, Volume 96, Issue 1–2, 2014
  6. Distributional Learning of Context-Free and Multiple Context-Free Grammars, A. Clark, R. Yoshinaka, Topics in Grammatical Inference, Chapter 6, 2016