Clustering aims to divide massive data without any prior information into groups with similar data structures or physical structures. To derive such a proposal, several different clustering strategies have been proposed, such as partition based methods, hierarchical based methods, distribution based methods, spectral based methods, den- sity based methods, grid based methods, machine learning based methods, and neural networks based methods. But most of the methods cannot characterize the uncertainty and imprecision between clusters at the same time. In recent years, evidential clustering (EC), based on the concept of credal partition, has received a lot of attention for its ability to deal with this problem. It inherits the advantages of the theory of be- lief functions (TBF) in reasoning about uncertain and imprecise knowledge. However, since EC is still in the development stage, there are still issues such as unclear under-lying concepts, high computational complexity, and inability todetect imbalanced or arbitrary clusters, limiting the applications of EC.
In this thesis, we work on proposing some alternative schemes to solve these issues. Specifically, four works are brought forward to handle them one by one. We start from a systematic study of EC. In this work, we present the concepts and definitions of data (inputs), methods (models), and clusters (results) of EC, as well as that of the other types (i.e. hard/fuzzy/possibilistic ones), based on different theories (i.e. probability theory, fuzzy sets theory, possibility theory, and the theory of belief functions). This is because EC is regarded as the evidential version of hard, fuzzy, and possibilistic clustering under the TBF. These concepts and definitions explain why EC can better characterize the uncertainty and imprecision between clusters. Moreover, we also study EC from the seminal to state-of-the-art methods in the context of data-based inputs, resulting in a coherent and comprehensive survey to analyze these methods. Based on the analysis of some representative techniques from different perspectives (e.g. center, measure type, complexity), we provide a guiding scheme to help users to choose appropriate evidential methods in their cases.
Afterward, we introduce a dynamic evidential clustering (DEC) algorithm for the issue of the high computational complexity of traditional EC when characterizing the uncertainty and imprecision between clusters. In DEC, most query objects are considered to have precise cluster information, so an FCM-like objective function is first employed and minimized to obtain the support levels of the real singletons (specific) clusters to which the query objects belong. Then the query object is initially adap- tively assigned to the outlier, precise or imprecise one via a new rule based on the conflicts between the different support levels. Each imprecise object is finally reassigned to the singleton clusters or related meta-cluster by partial credal redistribution with the corresponding dynamic edited framework to reduce the computational burden. The proposed DEC method can reduce the complexity to a level similar to that of fuzzy/possibilistic clustering, extending EC’s application in big data.
Following, we extend EC to detect imbalanced clusters by combining mean shift with traditional EC under the TBF, mainly containing two characteristics. First, the query object is preliminarily assigned as the noise, precise, or imprecise one based on the notion of “belief shift”. Second, partial credal redistribution with dynamic cluster centers, to avoid the “uniform effect” (imbalanced clusters), is established to reassign imprecise objects to the singleton cluster or related meta-cluster. Once an object is assigned to a meta-cluster, it indicates that the imbalanced singleton clusters involved in the meta-cluster cannot be distinguished because this object may be located in the overlapping or intermediate areas of these imbalanced singleton clusters. By doing this, the BSC can reasonably characterize the uncertainty and imprecision between imbalanced singleton clusters.
To avoid losing generality, we also investigate the representation of uncertainty and imprecision between clusters regardless of their shape, size, and dimensionality based on density peaks and the TBF. First, we consider that different neighbors can provide complementary evidence supporting the object as a cluster center and redefine a distance-based density function to obtain more robust cluster centers in the decision graph. Then, we present a new evidential convergence rule to assign the remaining objects to different clusters. Finally, similar to BSC, the objects located in the over- lapping or intermediate areas of different arbitrary singleton clusters are assigned to corresponding meta-clusters to characterize the uncertainty and imprecision between these arbitrary clusters.
The effectiveness of the proposed algorithms is estimated on different artificial and natural datasets. Experiments show that our proposed algorithms effectively improve the execution efficiency of traditional EC and detect imbalanced or arbitrary clusters, and characterizes the uncertainty and imprecision between these clusters.
- DENOEUX Thierry, Université Technologique de Compiègne
- VRAIN Christelle, Université d'Orléans
- HAN Deqiang, Xi’an Jiaotong University
- MARTIN Arnaud, Université de Rennes 1 IUT de Lannion , directeur de thèse
- ZHOU Kuang, Northwestern Polytechnical University
- LIU Zhunga, Northwestern Polytechnical University