Vous êtes ici

Universal speech synthesis through embeddings of massive heterogeneous data

Equipe et encadrants
Département / Equipe: 
Site Web Equipe: 
Directeur de thèse
Laurent Amsaleg
Co-directeur(s), co-encadrant(s)
Gwénolé Lecorvé
Damien Lolive
NomAdresse e-mailTéléphone
Laurent Amsaleg
02 99 84 74 44
Gwénolé Lecorvé
02 96 46 90 64
Damien Lolive
02 96 46 91 65
Sujet de thèse

The principle of unit selection speech synthesis is to concatenate segments of actual speech such that they match as much as possible with the user input utterance [1, 2]. To do so, the system relies on a database of speech segments which are described with information about their pronunciation (phonemes, articulatory traits...), position (in the syllable, in the word…), prosody (fundamental frequency, speaking rate…), or even more abstract levels of the language (morphosyntax, semantics, emotions…). Thus, given an analogue description of the user query, the speech synthesis engine must be able to accurately and quickly find the most relevant speech segments.


However, no natural distance exists yet to compare vectors of these segment descriptors, i.e. there is no reliable computational way to determine which segments are perceptually close, and which others are not. To alleviate this problem, current systems are restricted to very specific conditions: a given language, a single speaker, and a predetermined interaction context. As a consequence, these systems rely on small speech databases (less than 10 hours) with a reduced set of descriptors (around ten), as well as on some linguistic expertise and engineering tricks.


The objective of this PhD thesis is to remove these restrictions in order to make speech synthesis systems as universal as possible. This will consist in proposing and studying transformation methods from classical descriptors to new high-dimensional continuous representations on which standard vector space distances can be applied. Work will draw on recent advances in deep neural networks in multimedia and natural language processing [3, 4, 5, 6], and on modern very large multidimensional database indexing techniques [7, 8]. During the thesis, the developed methods will be integrated to the team’s speech synthesis engine [2], and tested on massive heterogeneous speech corpora.


Considering the current research focus in the speech community, this proposal has a perfect timing, and results of this work will have a major impact in terms of visibility. By allowing synthetic speech signals to greatly integrate variability, the proposed solutions will enable new applications for speech synthesis in domains such as advertisement, virtual reality, and robotics.


[1] Jacob Benesty, M Mohan Sondhi et Yiteng Huang. Springer handbook of speech processing. Springer Science & Business Media, 2007.


[2] Pierre Alain, Jonathan Chevelu, David Guennec, Gwénolé Lecorvé, Damien Lolive. The IRISA Text-To-Speech System for the Blizzard Challenge 2015. In Proceedings of the Blizzard Challenge 2015 Workshop, 2015.


[3] Krizhevsky, A., Sutskever, I., & Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Neural Information Processing Systems Conference, 2012.


[4] Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In IEEE Conference on Computer Vision and Pattern Recognition, 2011.


[5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado et Je Dean.  Distributed

representations of words and phrases and their compositionality. In Advances in neural

information processing systems. 2013.


[6] Badino L. Phonetic Context Embeddings for DNN-HMM Phone Recognition. In Proceedings of Interspeech, 2016.


[7] Alexandr Andoni et Piotr Indyk.  Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Annual IEEE Symposium on Foundations of Computer Science (FOCS'06). IEEE. 2006.


[8] Marius Muja et David G. Lowe.  Scalable nearest neighbor algorithms for high dimensional data. In IEEE Transactions on Pattern Analysis and Machine Intelligence 36.11, 2014.

Début des travaux: 
October 1, 2017
Mots clés: 
Natural language processing, Speech synthesis, Deep neural networks, Embeddings, Indexing
IRISA - Campus de Lannion, 6, rue de Kerampont, Lannion