Miss2DA: Imputation of missing data in a domain adaptation context

Publié le
Equipe
Date de début de thèse (si connue)
Automne 2024
Lieu
Rennes
Unité de recherche
IRISA - UMR 6074
Description du sujet de la thèse

Context

AI methodologies typically depend on extensive datasets that may be tainted by noise, missing values, or can be collected in heterogeneous yet related environments. Data with missing values are ubiquitous in many applications; they can be due to equipment failure, incomplete information collection (e.g. clouds in the remote sensing case) or inadequate data entry for instance. Nevertheless, conventional learning algorithms often assume that the data are complete and independent and identically distributed, that is to say they have been drawn randomly from a single distribution.

Data imputation aim at substituting missing data by plausible values [1], e.g. by filling them by the value of the nearest sample or by imputing with some relevant statistics. The imputation can have a high impact on performances of the learning task at hand, leading to biased results or degraded performances [2, 3]. Most of the imputation methods rely on some missing (completely) at random assumption [4], meaning there is no pattern between the missingness of the data and any values, or that at it can depend only on the observed values. More challenging scenario deal with random block missing or blackout missing [5], in which blocs of information are missing and where the structure of block-wise missing data should be further taken into consideration.

On the other hand, the outcomes generated by AI play a crucial role in monitoring and comprehending environmental phenomena through the resolution of various tasks, including but not limited to:
- land cover and land use mapping, that can then be further used for urban planning, agriculture management, or identifying illegal land use activities for instance;
- crop yield prediction, in order to ensure food security, economic stability, and sustainable agricultural practices;
- wildlife conservation, in which wildlife habitats, migration patterns and population changes can be evaluated at a large scale;
- fisheries control, to identify, measure and check the aquatic resources that are harvested, aiming to protect over-exploited aquatic species.

In practice, the data are often collected on different yet related domains, offering the potential to enhance the generalization capability of the learning algorithm. For instance, in Earth observation, and especially for land cover mapping applications, the differences in weather, soil conditions or farmer practices between study sites are known to induce temporal shifts that can be corrected to enhance task performance. For predicting crop yield, the variability under changing climates and severe weather events have to be taken into account when considering data from the past to predict the evolution of the yield.
Domain adaptation [6, 7] aims to transfer knowledge from one domain to another and has demonstrated significant enhancements in classification or clustering tasks when domain shifts are carefully managed.

 

Scientific objectives and expected achievements.

The aim of the PhD project is to devise a data imputation method within the context of domain adaptation. Existing approaches mostly tackle missing values within an inferential framework, wherein they are replaced with values derived from dataset statistics, relying on robust parametric assumptions. However, when a shift exists between the datasets, this strategy becomes inadequate. Instead, we propose to address imputation and learning tasks concurrently, introducing the additional complexity that the data may originate from different domains. The primary objectives of the PhD are as follows: i) theoretically analyze the impact of domain shifts on the learning task, akin to the framework established for domain adaptation in a classification context [7]; ii) introduce novel imputation schemes in heterogeneous environments by aligning distributions in a preliminary step and subsequently applying learning tasks (e.g., supervised learning); iii) propose an integrated framework for imputation and learning in heterogeneous environments. Special attention will be given to handling time series data, where the challenge is heightened due to the need to account for temporal correlations between observations. This will allow considering the \textit{blackout missing} case as a special case of time series prediction.

The initial research directions will explore optimal transport-based solutions, known for their success in imputing missing values [8] and aligning distributions in a domain adaptation context [9], especially when dealing with temporal data [10].
Additional avenues of investigation will be considered such as domain-adversarial training [11] and other standard domain adaptation strategies.
From an applicative view point, a specific focus will be put on tackling environmental challenges using remote sensing data [12].

 

Bibliographie

[1] A. R. T. Donders, G. J. Van Der Heijden, T. Stijnen, and K. G. Moons, “A gentle introduction to imputation
of missing values,” Journal of clinical epidemiology, vol. 59, no. 10, pp. 1087–1091, 2006.

[2] T. Shadbahr, M. Roberts, J. Stanczuk, J. Gilbey, P. Teare, S. Dittmer, M. Thorpe, R. V. Torne, E. Sala,
P. Lio et al., “Classification of datasets with imputed missing values: Does imputation quality matter?” arXiv
preprint arXiv:2206.08478, 2022.

[3] Z. Zhang, X. Xiao, W. Zhou, D. Zhu, and C. I. Amos, “False positive findings during genome-wide association
studies with imputation: influence of allele frequency and imputation accuracy,” Human Molecular Genetics,
vol. 31, no. 1, pp. 146–155, 2022.

[4] T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, B. Mphago, and O. Tabona, “A survey on missing data
in machine learning,” Journal of Big Data, vol. 8, no. 1, pp. 1–37, 2021.

[5] F. Xue and A. Qu, “Integrating multisource block-wise missing data in model selection,” Journal of the
American Statistical Association, vol. 116, no. 536, pp. 1914–1927, 2021.

[6] A. Farahani, S. Voghoei, K. Rasheed, and H. R. Arabnia, “A brief review of domain adaptation,” Advances in
data science and information engineering: proceedings from ICDATA 2020 and IKE 2020, pp. 877–894, 2021.

[7] I. Redko, E. Morvant, A. Habrard, M. Sebban, and Y. Bennani, Advances in domain adaptation theory.
Elsevier, 2019.

[8] ——, “A survey on domain adaptation theory: learning bounds and theoretical guarantees,” arXiv preprint
arXiv:2004.11829, 2020.

[9] B. Muzellec, J. Josse, C. Boyer, and M. Cuturi, “Missing data imputation using optimal transport,” in Inter-
national Conference on Machine Learning. PMLR, 2020, pp. 7130–7140.

[10] N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy, “Joint distribution optimal transportation for
domain adaptation,” Advances in neural information processing systems, vol. 30, 2017.

[11] F. Painblanc, L. Chapel, N. Courty, C. Friguet, C. Pelletier, and R. Tavenard, “Match-and-deform: Time
series domain adaptation through optimal transport and temporal alignment,” in Joint European Conference
on Machine Learning and Knowledge Discovery in Databases. Springer, 2023, pp. 341–356.

[12] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky,
“Domain-adversarial training of neural networks,” Journal of Machine Learning Research, vol. 17, no. 1, pp.
2096–2030, 2016.

[13] J. Nyborg, C. Pelletier, S. Lef`evre, and I. Assent, “Timematch: Unsupervised cross-region adaptation by
temporal shift estimation,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 188, pp. 301–313,
2022.

Liste des encadrants et encadrantes de thèse

Nom, Prénom
CHAPEL, Laetitia
Type d'encadrement
Directeur.trice de thèse
Unité de recherche
UMR 6074
Equipe

Nom, Prénom
Tavenard, Romain
Type d'encadrement
2e co-directeur.trice (facultatif)
Unité de recherche
LETG
Contact·s
Nom
CHAPEL, Laetitia
Email
laetitia.chapel@irisa.fr
Nom
Tavenard, Romain
Email
romain.tavenard@univ-rennes2.fr
Mots-clés
Machine learning, unsupervised and supervised domain adaptation, missing data imputation, time series