Profiling and Vizualizing Android Malware Datasets

Defense type
Thesis
Starting date
End date
Location
Other
Room
Salle téléTD (à côté de la bibliothèque), CentraleSupélec, campus de Rennes, avenue de la boulais, 35510 Cesson-Sévigné
Speaker
Tomas CONCEPCION MIRANDA / CIDRE
Main department
Theme

M. Tomas CONCEPCION MIRANDA, équipe CIDRE, soutiendra publiquement ses travaux de thèse, intitulés :

« Profiling and Vizualizing Android Malware Datasets »

Dirigés par M. Jean-François LALANDE

Soutenance prévue le mardi 29 novembre 2022 à 08h00

Lieu : CentraleSupélec, campus de Rennes, avenue de la boulais, 35510 Cesson-Sévigné

Salle : téléTD (à côté de la bibliothèque)

Visio-conférence partielle : https://youtu.be/6tfetOwdrNM

 

Keywords: Malware, Datasets, Bias, Visualization

 

Résumé : Mobile devices are ubiquitous: nowadays most people own a mobile telephone. Because of this, it is a target of interest for attackers. Researchers in malware analysis put their effort to recognize these types of programs before they are installed on a user device. To do this, they perform experiments to automatically detect malware, for example with machine learning, where they use sets of already known malware and goodware. Depending on their choice of datasets, the evaluation of the experiments can yield acceptable results, or outstanding but overestimated results. Consequently, datasets with malware and benign samples are important elements to consider when designing an experiment.

This thesis presents, first, a method to evaluate the quality of datasets based on a statistical test that helps to compare a crafted dataset against a large set of applications such as markets. We show that historical datasets of the literature are of low quality, which justifies the need to create new up-to-date datasets. Second, we introduce an algorithm to update mixed datasets of malware/goodware of low quality in order to resemble a target dataset that cannot be used directly, e.g. a market. We evaluate the updated mixed datasets using a machine learning algorithm and we show that the detection of malware in our up-to-date dataset becomes a more difficult problem to solve. Lastly, we introduce DaViz, a dataset visualization tool for exploring and comparing Android malware datasets, which enables researchers to visualize the biases in datasets of the literature, and obtain useful information from them.

Composition of the jury
- HUDELOT Céline, CentraleSupélec

- ROCA Vincent, Inria

- LALANDE Jean-François, CentraleSupélec

- VIET TRIEM TONG Valérie, CentraleSupélec

- KLEIN Jacques, Université du Luxembourg

- CAVALLARO Lorenzo, University College London