Alexey Ozerov Institut TELECOM; TELECOM ParisTech; CNRS LTCI - Signal and Image Processing Department alexey.ozerov@telecom-paristech.fr A. Ozerov (Stereo GMM) ALGORITHM: We are using a hybrid approach with Gaussian Mixture Models (GMMs), where each stereo source image can be modeled by one of two following model types: - \"directional GMM\": source image is modeled as an instantaneous point source image with source modeled by GMM, as in [1], - \"non-directional GMM\": stereo source image Short Time Fourier Transforms (STFTs) of size say [F x N] are concatenated together to form a stereo [(2*F) x N] STFT. This stereo STFT is then modeled by a GMM, as described in [2]. In other words, in contrast to the \"directional GMM\" case, it is assumed that conditionally on GMM state there is no correlation between left and right channel STFTs. This approach is applied in the following setting: 1. Models a) for \"Tamy - Que pena tanto faz\": 1. \"vocals\" are modeled by 8 states \"directional GMM\" 2. \"guitar\" is modeled by 8 states \"non-directional GMM\" b) for \"Bearlin - Roads\" (for this song we always consider a 2 sources separation problem, where a desired source is separated from its background, i.e., \"everything_else - source\") 1. \"bass\" is modeled by 8 states \"directional GMM\" \"bass_background\" is modeled by 8 states \"non-directional GMM\" 2. \"vocals\" are modeled by 8 states \"directional GMM\" \"vocals_background\" is modeled by 8 states \"non-directional GMM\" 3. \"piano\" is modeled modeled by 8 states \"non-directional GMM\" \"piano_background\" is modeled by 8 states \"non-directional GMM\" 2. All GMMs (and directions in the case of \"directional GMMs\") are learned from the development data using the standard Expectation-Maximization (EM) algorithm (see e.g., [2]). 3. Given GMMs and directions, the sources are recovered via Wiener filtering, as described in [1]. COMPUTATIONAL TIME Our Matlab implementation on 2.2 GHz CPU runs a. 120 seconds for \"Tamy - Que pena tanto faz\" b. 390 seconds for \"Bearlin - Roads\" REFERENCES: [1] S. Arberet, A. Ozerov, R. Gribonval, F. Bimbot, \"Blind Spectral-GMM Estimation for Underdetermined Instantaneous Audio Source Separation\", ICA\'09, 2009, (submitted). [2] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribonval, \"Adaptation of Bayesian models for single channel source separation and its application to voice / music separation in popular songs,\" IEEE Trans. on Audio, Speech and Lang. Proc., special issue on Blind Signal Proc. for Speech and Audio Applications, vol. 15, no. 5, pp. 1564-1578, July 2007.