Application of Concepts of Neighbours to Knowledge Graph Completion

Author: Sébastien Ferré (homepage)

This page provides information for the reproducibility of our experiments on the use of Concepts of Nearest Neighbours (CNN) to the task of knowledge graph completion, aka. link prediction in knowledge graphs.

What are CNNs?

Given a knowledge graph whose nodes are entities, and whose edges are labelled with relations, the CNNs of an entity e are clusters of similar entities in the graph where, for each cluster, the similarity is formally represented by a Concept of Nearest Neighbours (CNN). Such a concept is composed of an intent and an extent. The intent is a conjunctive query Q (i.e. a graph pattern with a distinguished node) that characterizes what the entities in the cluster have in common with entity e. The extent is the set of answers of query Q, i.e. all entities that are at least as similar as entities in the cluster. Such concepts define a symbolic form of distance between entities, hence the analogy with k-nearest neighbours. The larger the concept intent or the smaller the concept extent, the closer (or the more similar) entities in the cluster are.

Technical details about the definition and computation of CNNs can be found in the following publications:

S. Ferré. Link Prediction in Knowledge Graphs with Concepts of Nearest Neighbours. In The Semantic Web (ESWC), LNCS 11503, 2019. Springer.
S. Ferré. Answers Partitioning and Lazy Joins for Efficient Query Relaxation and Application to Similarity Search. In The Semantic Web (ESWC), LNCS 10843, 2018. Springer.
S. Ferré. Concepts de plus proches voisins dans des graphes de connaissances (in French). In Journées francophones d'Ingénierie des Connaissances (IC), pages 163-174, 2017.

In this page, we consider the use of CNNs for the task of link prediction in knowledge graphs. The task consists in predicting a missing entity of a triple (head, relation, tail), i.e. either predicting the missing tail given the subject and relation, or predicting the head given the object and relation.

Implementation and usage

The source code of our implementation is available as part of the BitBucket repository of SEWELIS. The main relevant code is found in branch dev in source files src/cnn.ml (computation of CNNs) and src/cnn_expe.ml (experiments based on CNNs). For the simplicity of reusing the program on other datasets or with different parameters, we here provide an executable binary for Linux64 Fedora 25 system (it should work on any Linux64 distribution).

Usage of the program (run ./cnn_expe --help for more details).

./cnn_expe -train file.rdf [-valid file.rdf] -test file.ttl [-inv] [-maxprio integer] [-timeout float] [-v]

The train and valid RDF files are treated alike (no validation phase) and can be in various formats (RDF/XML, Ntriples, Turtle). The test file must be in Turtle format. The flag -inv asks to predict the heads of triples in addition to the tails. The maxprio option enables to specify a maximum depth, and determines how far from the entity the knowledge graph should be explored. It must be a positive integer (default: no limit). The timeout is in seconds (default 1s), and is split in two between CNN computation and prediction. The flag -v is for verbose more, which allows to get explanations for the predictions.

Datasets

We have experimented on 5 datasets, four classical ones, and a new one.

WN18 : a dataset extracted from WordNet, introduced in
A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston and O. Yakhnenko, Translating embeddings for modeling multi-relational data, in: Advances in neural information processing systems, 2013, pp. 2787–2795.
Here is an RDF version formatted for our program.
WN18RR : a dataset improving WN18 by removing inverse relations, introduced in
T. Dettmers, P. Minervini, P. Stenetorp and S. Riedel, Convolutional 2D Knowledge Graph Embeddings, in: Conf. Artificial Intelligence (AAAI), S.A. McIlraith and K.Q. Weinberger, eds, AAAI Press, 2018, pp. 1811–1818.
Here is an RDF version formatted for our program.
FB15k : a dataset extracted from Freebase, introduced in
A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston and O. Yakhnenko, Translating embeddings for modeling multi-relational data, in: Advances in neural information processing systems, 2013, pp. 2787–2795.
Here is an RDF version formatted for our program.
FB15k-237 : a dataset improving FB15k by removing inverse triples, introduced in
Toutanova, K., Chen, D.: Observed versus latent features for knowledge base and text inference. In: Work. Continuous Vector Space Models and their Compositionality. pp. 57--66 (2015)
Here is an RDF version formatted for our program.
Mondial : a dataset extracted from the MONDIAL database.
Here is an RDF version formatted for our program.

Results

The results are summarized in the following table, which provides standard link prediction measures (Hits@1, Hits@10, MRR), as well as a link to the raw output log that provides additional details and explanations for the predictions. Those results have been obtained between 29 October 2019 and 29 December 2019.

dataset	depth	timeout(s)	Hits@1	Hits@3	Hits@10	MRR	output log
WN18	3	1+1	0.967	0.970	0.972	0.969	Download
WN18RR	3	1+1	0.444	0.481	0.519	0.469	Download
FB15k	3	1.5+0.5	0.827	0.861	0.890	0.849	Download
FB15k-237	3	1.5+0.5	0.222	0.322	0.446	0.296	Download
Mondial	3	1+1	0.409	0.511	0.645	0.485	Download