PhD: Probabilistic object-based representation of audio signals, applied to high-level music description and classification
Supervisors
Emmanuel Vincent emmanuel.vincent@irisa.fr
Frédéric Bimbot frederic.bimbot@irisa.fr
Description of the project
Audio signals usually consist of several sound sources (speakers, musical instruments, natural sounds), which convey a large amount of information (e.g. speaker identity, musical genre, recording environment). This information is often not available and must be automatically estimated from the signals based on relevant features. It can then be used to generate textual or visual descriptions, answer classification queries or retrieve signals by similarity.
Existing information retrieval algorithms for audio signals mostly rely on low-level features, such as Mel-Frequency Cepstral Coefficients (MFCCs), which model all sound sources as a whole and exploit short-term dependencies only [1]. These algorithms have been shown to exhibit experimental performance ceilings for a range of classification tasks [2]. The representation of the signals in terms of constituent sound objects appears necessary to derive higher-level features and achieve better classification results, similar to those obtained from symbolic audio data [3]. "Ideal" object-based representations consisting of speech phonemes and musical notes remain however very difficult to estimate in a robust manner, due in particular to the variety of possible sources and the masking of low-energy sources by higher-energy ones.
The goal of this PhD project is to propose alternative object-based representations of audio signals being applicable to a wide range of sources and leading to robust high-level features. One approach could consist of recasting existing sparse models of the source short-term spectra [4,5] into a probabilistic framework. This would allow the use of probabilistic priors about the model parameters, thus improving their robustness against masking phenomena and providing an estimate of their variance (or uncertainty). Variance values could then be incorporated into the definition of high-level features.
The proposed representations will be primarily applied to the description and the classification of musical audio within large databases, using high-level features inspired from symbolic musicological features [3]. Popular classification tasks such as genre and singer identification will be considered, as well as more advanced tasks including composer or multiple instrument identification. Depending on the background of the applicant, additional tasks such as temporal decomposition of speech or analysis of natural sound scenes will also be considered.
References
[1] M.I. Mandel and D.P.W. Ellis, "Song-level features and SVMs for music classification", in Proc. Int. Conf. on Music Information Retrieval, pp. 594-599, 2005.
[2] J.-J. Aucouturier, B. Defreville and F. Pachet, "The bag-of-frame approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music," to appear in Journal of the Acoustical Society of America, 2008.
[3] C. McKay, "Automatic genre classification of MIDI recordings," M.A. Thesis, McGill University, Canada, 2004.
[4] T. Virtanen, "Unsupervised learning methods for source separation," in "Signal Processing Methods for Music Transcription", eds. A. Klapuri, M. Davy, Springer-Verlag, 2006.
[5] E. Vincent, N. Bertin and R. Badeau, "Harmonic and inharmonic nonnegative matrix factorization for polyphonic pitch transcription," to appear in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2008.
Candidate profile
Prospective candidates should have a background in pattern recognition, machine learning, applied statistics or signal processing. Additional expertise in the fields of audio and music is welcome. Proficient programming in Matlab or C would be an asset.