Knowledge-driven dataset FAIRification: from workflow runs to domain-specific annotations

Publié le
Equipe
Date de début de thèse (si connue)
octobre 2024
Lieu
IRISA Rennes + Institut du Thorax Nantes
Unité de recherche
IRISA - UMR 6074
Description du sujet de la thèse

Context and problem statement

Intracranial aneurysm is a cerebral vascular anomaly affecting 3.2% of the French population. While its rupture can lead to death or severe disability, there is no diagnostic tool, and predicting the stability or rupture is still a challenge in biomedical sciences. Investigating this pathology necessitates i) the acquisition of diverse datasets at various scales: sequencing genomic data, microscopy images of of vascular tissues, MRI neurovascular images, and various clinical observations or measures (e.g. familial history, life habits, hypertension, etc.), and ii) the development modality-specific data analysis workflows. In the context of multidisciplinary and multi-site collaborations, there is an urgent need to produce trustworthy and reusable data analysis results, at a limited data annotation cost.
These needs are not specific to intracranial aneurysm and have recently been identified in the context of machine learning (ML) workflows [1]. The resulting predictive models are tightly coupled to the training datasets. It is crucial to document with metadata the provenance of the resulting models so that scientists can better find and reuse possibly pre-trained models,
while being aware of their possible bias, of the kind of used algorithms and optimization parameters, as well as their predictive performance evaluation.
Currently, from a computer science perspective, we lack both i) the unified conceptual framework for representing domain-specific and technical annotations on data, analysis workflows, scientific context and ii) the reasoning framework for automating the inference of higher-level annotations improving the reusability, the trustfulness, or more generally the FAIRness of data analysis results.

 

Objective

This thesis aims at designing a semi-automated dataset FAIRification method that will extend low-level metadata by higher level descriptions inferred from the workflow specification and execution. These descriptions will provide a summary focusing on the functional aspect, that will complement the technical informations provided by the metadata. This will be instrumental to workflow recommendation as well as improved reusability of data analysis results.
To this end, we will leverage domain-specific knowledge associated to biomedical datasets, as well as fine-grained workflow execution provenance traces so that data analysis results can be more easily understood, explained and shared, in line with critical open and reproducible sciences initiatives.

 

3 Approach
Multi-view metadata schema. The first contribution will consist in defining the adequate knowledge model by reusing and possibly extending reference ontologies for integrating the descriptions of biomedical data and their processing (specification and execution of tools and workflows). We will also benchmark alternative technical implementation solutions for the different aspects: RDF Named Graphs, RDF-Star for architecture; ShEx, SHACL for integrity constraints validation.

Methods for enriching workflow run descriptions. The second contribution will consist in designing a method to document and generate high-level descriptions based on the integration of data, tools and workflow annotations. This will possibly encompass multi-modal reasoning, ML-based link prediction methods, as well as interactions with large language models. Overall, we will compare reasoning strategies for pre-computing these descriptions VS generating them on the fly.

Experimental study. We will assess how “human oriented” and “machine-actionable” data analysis reports resulting from contribution 2 can improve trustworthiness. This third contribution will consist in populating the case-study knowledge graph according to the model from the first contribution, and enriching it by applying the enrichment methods from the second contribution. This will support a performance and usability study covering both data analysis and privacy.

 

Related works

The dataset harvested through provenance collection can be valuable for a number of tasks that go beyond interpreting the results. For them to be useful, they must be accompanied by semantic annotations that help users to understand and exploit them. As manual annotation can be both tedious and time consuming, means have been investigated for semi-automating this task (such as LabelFlow [2], and earlier works [3]) to exploit the workflow description to propagate annotations among the datasets used and generated by the modules that compose the workflow. Such solutions are still limited. They assume that the processing modules are annotated, which is not always the case. When provided, the annotation may fail in identifying meaningful concept for life scientists.
Therefore, there is a need for new solutions that exploit other sources of information, in addition to the workflow description, to infer data annotation, which can take different forms (a concept, a term, a description) and accompany the data sets collected within the provenance traces. Several Knowledge Graph-based approaches are actually following this direction: [4], [5], OpenPREDICT [6], the Evidence Graph Ontology (EVI) [7] or FAIRSCAPE [8].

Working environment

The PhD will be co-supervized by Olivier Dameron (IRISA team DYLISS, Rennes) and Alban Gaignard (CNRS, Institut du Thorax, Nantes). The main location will be Rennes, with frequent work sessions in Nantes.
The PhD is part of the ShareFAIR project from the “Digital health” French Prioritary Research Program and Equipment plan (PEPR santé numérique). ShareFAIR is composed of 9 partners in computer science, biology and bioinformatics. It develops innovative solutions for the annotation of biomedical and clinical datasets and extraction of provenance. The PhD will involve collabroations with other partners, in particular Sarah Cohen (LISN) and Khalid Belhajjame (LAMSADE).

 

 

Bibliographie

[1] Ian Walsh et al. DOME: recommendations for supervised machine learning validation in biology. Nature Methods, 18(10):1122–1127, October 2021.
[2] Pinar Alper et al. Labelflow framework for annotating workflow provenance. Informatics, 5:11, 2018.
[3] Paolo Missier et al. Data lineage model for taverna workflows with lightweight annotation requirements. In International Provenance and Annotation Workshop, 2008.
[4] Alban Gaignard et al. Findable and reusable workflow data products: A genomic workflow case study. Semantic Web, 11:751–763, 2020.
[5] Remzi Celebi et al. Evaluation of knowledge graph embedding approaches for drug-drug interaction prediction in realistic settings. BMC bioinformatics, 20(1):726, 2019.
[6] Remzi C et al. Towards fair protocols and workflows: the openpredict use case. PeerJ Computer Science, 6, 2020.
[7] Sadnan Al Manir et al. Evidence graphs: Supporting transparent and fair computation, with defeasible reasoning on data, methods, and results. bioRxiv, 2021.
[8] Maxwell Adam Levinson et al. Fairscape: a framework for fair and reproducible biomedical analytics. Neuroinformatics, 20:187 – 202, 2020

Liste des encadrants et encadrantes de thèse

Nom, Prénom
DAMERON, Olivier
Type d'encadrement
Directeur.trice de thèse
Unité de recherche
UMR 6074 IRISA
Equipe

Nom, Prénom
GAIGNARD, Alban
Type d'encadrement
Co-encadrant.e
Unité de recherche
UMR 1087 Inst. Thorax Nantes
Contact·s
Nom
DAMERON, Olivier
Email
olivier.dameron@irisa.fr
Mots-clés
scientific workflows, FAIR