Historical Pandemic Musing on Big Data Summarization with Novel Applications.

Date de début
Date de fin
IRISA Rennes
Salle Petri-Turing

Amr El Abbadi

Department of Computer Science

University of California at Santa Barbara

Santa Barbara, CA 93106


During the past two decades we have seen an unprecedented increase in the amount of data that is being generated from numerous internet-scale applications. As hundreds of millions to billions of users interact with these applications, there is a continuous flow of interaction or log data that is collected by internet companies hosting these applications. Before this data can be subject to modeling and analysis, it is often necessary to obtain summary statistics such as the cardinality of unique visitors, frequency counts of users from different states or countries, and in general, finding the quantile and median information from the dataset. Efficient algorithms exist for computing the exact information over the data. Unfortunately, these algorithms require a considerable amount of time, scanning the data multiple times, or require additional storage that is linear in the size of the dataset itself. Approximation methods, with guaranteed error bounds, developed in the context of streaming data are extremely effective to extract useful and relatively accurate knowledge from big data. In this talk, we will review the recent, and not so recent, advances in big data summarization. The main objective of this tutorial-style talk is to demonstrate the strong relationship between the mathematics of big data and the management of big data.  We also show some of our recent results and how some of these approaches have diverse applications even in system design.