Application of Concepts of Neighbours to Knowledge Graph Completion

Author: Sébastien Ferré (homepage)

This page provides information for the reproducibility of our experiments on the use of Concepts of Nearest Neighbours (CNN) to the task of knowledge graph completion, aka. link prediction in knowledge graphs.

What are CNNs?

Given a knowledge graph whose nodes are entities, and whose edges are labelled with relations, the CNNs of an entity e are clusters of similar entities in the graph where, for each cluster, the similarity is formally represented by a Concept of Nearest Neighbours (CNN). Such a concept is composed of an intent and an extent. The intent is a conjunctive query Q (i.e. a graph pattern with a distinguished node) that characterizes what the entities in the cluster have in common with entity e. The extent is the set of answers of query Q, i.e. all entities that are at least as similar as entities in the cluster. Such concepts define a symbolic form of distance between entities, hence the analogy with k-nearest neighbours. The larger the concept intent or the smaller the concept extent, the closer (or the more similar) entities in the cluster are.

Technical details about the definition and computation of CNNs can be found in the following publications:

In this page, we consider the use of CNNs for the task of link prediction in knowledge graphs. The task consists in predicting a missing entity of a triple (head, relation, tail), i.e. either predicting the missing tail given the subject and relation, or predicting the head given the object and relation.

Implementation and usage

The source code of our implementation is available as part of the BitBucket repository of SEWELIS. The main relevant code is found in branch dev in source files src/cnn.ml (computation of CNNs) and src/cnn_expe.ml (experiments based on CNNs). For the simplicity of reusing the program on other datasets or with different parameters, we here provide an executable binary for Linux64 Fedora 25 system (it should work on any Linux64 distribution).

Usage of the program (run ./cnn_expe --help for more details).

./cnn_expe -train file.rdf [-valid file.rdf] -test file.ttl [-inv] [-maxprio integer] [-timeout float] [-v]
The train and valid RDF files are treated alike (no validation phase) and can be in various formats (RDF/XML, Ntriples, Turtle). The test file must be in Turtle format. The flag -inv asks to predict the heads of triples in addition to the tails. The maxprio option enables to specify a maximum depth, and determines how far from the entity the knowledge graph should be explored. It must be a positive integer (default: no limit). The timeout is in seconds (default 1s), and is split in two between CNN computation and prediction. The flag -v is for verbose more, which allows to get explanations for the predictions.

Datasets

We have experimented on 5 datasets, four classical ones, and a new one.

Results

The results are summarized in the following table, which provides standard link prediction measures (Hits@1, Hits@10, MRR), as well as a link to the raw output log that provides additional details and explanations for the predictions. Those results have been obtained between 29 October 2019 and 29 December 2019.
dataset depth timeout(s) Hits@1 Hits@3 Hits@10 MRR output log
WN18 3 1+1 0.967 0.970 0.972 0.969 Download
WN18RR 3 1+1 0.444 0.481 0.519 0.469 Download
FB15k 3 1.5+0.5 0.827 0.861 0.890 0.849 Download
FB15k-237 3 1.5+0.5 0.222 0.322 0.446 0.296 Download
Mondial 3 1+1 0.409 0.511 0.645 0.485 Download