Vous êtes ici

Semantic Lifting of Complex Data by Hypergraph Compression - Stylistics in the usage of computer languages

Equipe et encadrants
Département / Equipe: 
Site Web Equipe: 
https://www-semlis.irisa.fr
Directeur de thèse
Sébastien Ferré
Co-directeur(s), co-encadrant(s)
Peggy Cellier
Contact(s)
NomAdresse e-mailTéléphone
Peggy Cellier
peggy.cellier@irisa.fr
02 99 84 22 84
Sujet de thèse
Descriptif

Application

An important challenge in Big Data and more specifically in the Semantic Web is semantic lifting, i.e. finding ways to turn data into knowledge. A large part of those data are documents expressed in languages ranging from computer languages (e.g., source code, SQL queries) to natural languages (e..g, web pages, books). Analogously to stylistics that aims at analysing documents in natural language, it is interesting to analyse documents in computer language to discover design patterns and good practices from experts’ usage in order to help transfer their knowledge to beginners. However, since the size of the documents and their complex low-level representations (e.g., instructions, operators), it is a challenge for a human to gain knowledge by merely reading the raw documents. Patterns discovered in documents could be used to perform a semantic lifting by re-encoding the low-level data into a compressed higher-level knowledge representation.

 

Objectives

The aim of this PhD is to propose an approach of semantic lifting based on hypergraphs, i.e. graphs with n-ary edges. We propose to use the same structure for both low-level data (e.g., control-flow graph) and higher-level knowledge (e.g., design patterns, idiomatics, boilerplate code). The advantage of hypergraphs is that they are expressive and already used to represent knowledge in the Semantic Web. In addition, using the same structure for data and knowledge allows to iterate steps of semantic lifting, and therefore to tune the abstraction level of the compressed knowledge.

 

Technical context and challenge

Formal Concept Analysis (FCA) [1] is a theoretical framework that is used for unsupervised knowledge extraction, and has been applied to many domains (e.g., social sciences, software refactoring). An extension of FCA, Graph-FCA [2,3], has recently been defined in the SemLIS team, where the input is a hypergraph, i.e. a set of n-ary relationships instead of one binary relationship, thus allowing more expressive representations. Graph mining algorithms [4,5] and Relational Concept Analysis (RCA) [6] can also be used to generate graph patterns that they are generalized to hypergraphs. However, the number of concepts can get very large, as it is often the case in data mining. A key issue of this PhD is to select a small subset of the graph concepts in order to re-encode the data at a higher level [7]. The selection criteria will follow the MDL principle (Minimum Description Length) [8,9], which has already been applied to FCA-like data but not yet to graphs. The idea is to use graph concepts to re-encode the representation of the original graph in a compressed way by factorizing out repeated patterns. The "best" concepts are those that compress the most.

 

Candidate profile

The expected research work requires a taste for theory, algorithmics, implementation, and experiments.

Bibliographie

[1] Ganter, B., Wille, R.: Formal Concept Analysis – Mathematical Foundations. Springer, Heidelberg (1999)

[2] Ferré, S.: A proposal for extending Formal Concept Analysis to knowledge graphs. In: Int. Conf. Formal Concept Analysis (ICFCA). LNAI 9113, pp. 271-286. Springer (2015)

[3] Ferré, S., Cellier P.: Graph-FCA in Practice - Int. Conf. Conceptual Structures (ICCS). LNAI 9717, Springer (2016)

[4] Huan, J., Wang, W., Prins, J.: Efficient mining of frequent subgraphs in the presence of isomorphism. Int. Conf. Data Mining. IEEE (2003)

[5] Yan, X. and Han, J.: CloseGraph: mining closed frequent graph patterns. Int. Conf. Knowledge discovery and data mining (SIGKDD). ACM (2003)

[6] Dolques, X. and Huchard, M. and Nebut, C. and Reitz, P.: Learning transformation rules from transformation examples: An approach based on relational concept analysis. Enterprise Distributed Object Computing Conf. Work. (EDOCW). IEEE (2010)

[7] M. Klimushkin, S. Obiedkov, and C. Roth: Approaches to the selection of relevant concepts in the case of noisy data. International Conference on Formal Concept Analysis. Springer Berlin Heidelberg, 2010.

[8] J. Vreeken, M. Van Leeuwen, A. Siebes: Krimp: mining itemsets that compress - Data Mining and Knowledge. Springer, 2011.

[9] D.J. Cook, and L.B. Holder. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research 1 (1994): 231-255.

Début des travaux: 
Septembre 2017
Mots clés: 
Semantic Web, semantic lifting, knowledge graphs, data mining, Formal Concept Analysis, KRIMP, Minimum Description Length, computer languages
Lieu: 
IRISA - Campus universitaire de Beaulieu, Rennes