Unfortunately, clustering very large collections of high dimensional points remains an open issue. Recently, however, two promising approaches have been proposed that may help solve this issue: the Random Forests (RF) approach, from the machine learning field, and the NV-Tree approach, coming from the related field of nearest neighbor retrieval.
Random Forests are known to be robust (supervised) classifiers achieving very good performance. Classifying a data collection with RF ask to build many decision trees, each tree being learned with randomly chosen subsets of the training data and attributes [Breiman01]. Overall, one data item to classify has to go down each trees in the forest. In turn, each tree votes for the class it believes the item belongs to. The forest eventually assigns to the item the class having the most votes over all trees.
Following the seminal work of Breiman03, RF can also be used to cluster data: the key idea is to consider the data to cluster as belonging to one class, to generate synthetic examples of a second class, and to learn RF for this 2-class problem. Then, a dissimilarity measure between the points from the data collection is built by studying their co-occurrences in the leaves of the trees. Finally, that measure is used to cluster data, grouping similar elements and separating dissimilar ones.
This measure is known to have interesting properties: it copes gracefully with mixed variable types, it is invariant to monotonic transformations of the input variables, it is robust to outlyers, it accommodates several strategies for dealing with missing data and easily deals with large number of variables due to its intrinsic random variable selection.
Coincidentally, we have designed a high dimensional index used in the context of content-based retrieval systems. That index, the NV-Tree, iteratively projects points from the data collection on random lines that subsequently get partitioned. Overall, the indexing strategy builds a tree where leaves tend to contain points that are likely to be quite close in the multidimensional data space [Lejsek09]. Since the index creation includes a lot of randomness, it is easy to create many different trees indexing the same collection of data. As for the RF, some hands on clustering seem possible by studying the co-occurrence of points in the leaves of the different NV-trees trees.
The goal of this internship is to:
Overall, the NV-Tree and the random forest are dual approaches: one separate things using lines (RF) while the other uses lines to group things (the NV-Tree). Clarifying which approach has more potential for subsequent clustering, and in which case, is the grand goal of that internship.
Different collections of experimental data are foreseen