Standard massive real and simulated datasets for large-scale machine learning [10, 11, 12] will serve as a testbed for the algorithms resulting from the thesis.
The thesis is expected to lead to the development and consolidation of a robust know-how for parameterizing the proposed techniques, and to the empirical demonstration of their ability to provide privacy-aware efficient learning .
Flexible sketching. The potential of sketched learning has already been demonstrated on GMM, clustering, and PCA. This opens new research opportunities, calling to revisit machine learning tasks such as unsupervised clustering, supervised clustering, regression, dictionary learning, spectral clustering, etc. A first objective in the thesis is to demonstrate this potential by proposing new sketching techniques for a selection of learning tasks relevant to CASA.
As a priority we will extend sketched learning to address selected problems expressed in terms of matrix factorization (NMF, ICA and sparse dictionary learning) and high-dimensional regression (GLLiM). An envisioned approach is to replace shift invariant-random Fourier features (used for skeched GMM and sketched clustering) with random rectified linear unit (ReLU) – a nonlinearity which has been popularized in Deep Neural Networks (DNN)– since it leads to the needed rotation invariance. This is expected to consolidate existing connections between sketching and DNNs. A more ambitious goal will be to propose new multiscale “hierarchical” sketching techniques able to efficiently address multi-layer inference in Bayesian networks.
Scalable learning. Sketching is a highly promising avenue to design new scalable learning algorithms: once the sketch is computed, the cost of learning no longer depends on the size of the collection, leading to immediate computational benefits in the large-scale setting (both in terms of memory and computational resources). Besides, computing the sketch can intrinsically be performed in a distributed or online fashion.
To fully benefit from the scalability of sketched learning, we ideally target an implementation of the “learn from sketch” stage with cost (sub)linear in the complexity of the task. Yet, current implementations based on random Gaussian projection matrices still mainly scale quadratically with the ambient dimension. To address this computational bottleneck, an envisioned approach will combine: the design of sketching mechanisms involving randomized fast transforms [8, 9] rather than random Gaussian matrices, which is expected to speed up both the sketching stage and the learning stage; the implementation of efficient aggregation mechanisms for the sketching stage through the use of modern infrastructures for distributed computing (Hadoop, Spark, ...); the analysis of the computational bottlenecks of the optimization algorithms used at the learning stage, to improve their resource-efficiency and scalability, e.g. using stochastic gradient methods.
Privacy-preservation has been identified as a strong side-effect of sketched learning, since sketches consist of aggregated information from all items of the training collection / stream, in stark contrast with traditional learning mechanisms where full access to each item of the training collection is given to a potentially adversarial learner. We expect to demonstrate empirically that sketched learning remains efficient when combined with standard privacy mechanisms such as the addition of “Laplacian noise”  to the sketches of the individual elements.