DRUID : Declarative & Reliable management of Uncertain, user-generated Interlinked Data
Recently, there is an increased interest for data management methods. Statistical machine learning techniques, empowered by the available pay-as-you-go distributed computing power, are able to extract useful information from certain data. The international press, being specialized or not, has echoed these remarkable results as a new Spring for Artificial Intelligence in a broad sense. The data is sometimes even referred as the “gold of the 21st century”. In any areas of business and science, one tries to construct huge datasets to be able to profit from the benefits of the Artificial intelligence revolution.
However, when datasets contain personal data, their collection and usage may lead to undesirable practices. In particular, there is a growing interest in privacy, mirroring the still growing interest in analytics over personal data. Machine Learning and Privacy can indeed be seen as two sides of the same coin: machine learning tries to extract relevant information from data, while privacy tends to blur information in order to hide identifying or sensitive individual information. In addition to the protection of the personal data input by machine learning algorithms, guaranteed by privacy models and privacy-preserving algorithms, the fairness of the output is critical for mitigating discrimination issues within automatic or “semi-automatic” high-stake decisions about individuals (e.g. laws, social rights, police).
Unfortunately, both these desirable needs – seamless machine learning and privacy – are not supported elegantly for now in the data management dogma. For example, Machine Learning operators are seen for now as external procedures outside the query language, barely accounted by the optimizer. Moreover, the knowledge extraction tasks are hard to design without understanding the available data, thus one should consider knowledge extraction as an interactive process, where users influence the process. Privacy-preserving algorithms often make an extensive use of cryptography, incurring prohibitive costs when considering typical volumes in data management use-cases. Additionally, the choice of a privacy model and of its parameters, among a large number of possible models, is barely understandable for non-expert database administrators. Finally,
privacy and fairness are usually considered apart without analyzing their mutual impacts.
These observations lay the ground for the goals of the DRUID team:
- Propose mechanisms to better integrate Machine Learning methods with the database logic and engines
- Propose interactive, human-in-the-loop data analysis and knowledge extraction methods even with uncertain data
- Make privacy-preserving techniques meet real-life constraints within data-centered systems, with a special focus on performance and intelligibility
- Design data-centered systems that are both private and fair.