Alexey
Ozerov
Institut TELECOM; TELECOM ParisTech; CNRS LTCI - Signal and Image Processing Department
alexey.ozerov@telecom-paristech.fr
A. Ozerov (Stereo GMM)

ALGORITHM:

We are using a hybrid approach with Gaussian Mixture Models (GMMs), where
each stereo source image can be modeled by one of two following model types:
   - \"directional GMM\": source image is modeled as an instantaneous point source
     image with source modeled by GMM, as in [1],
   - \"non-directional GMM\": stereo source image Short Time Fourier Transforms (STFTs)
     of size say [F x N] are concatenated together to form a stereo [(2*F) x N] STFT.
     This stereo STFT is then modeled by a GMM, as described in [2].
     In other words, in contrast to the \"directional GMM\" case, it is assumed that
     conditionally on GMM state there is no correlation between left and right channel STFTs.

This approach is applied in the following setting:

1. Models
     a) for \"Tamy - Que pena tanto faz\":
         1. \"vocals\" are modeled by 8 states \"directional GMM\"
         2. \"guitar\" is modeled by 8 states \"non-directional GMM\"
     b) for \"Bearlin - Roads\" (for this song we always consider a 2 sources
        separation problem, where a desired source is separated from its background, i.e.,
        \"everything_else - source\")
         1. \"bass\" is modeled by 8 states \"directional GMM\"
            \"bass_background\" is modeled by 8 states \"non-directional GMM\"
         2. \"vocals\" are modeled by 8 states \"directional GMM\"
            \"vocals_background\" is modeled by 8 states \"non-directional GMM\"
         3. \"piano\" is modeled modeled by 8 states \"non-directional GMM\"
            \"piano_background\" is modeled by 8 states \"non-directional GMM\"

2. All GMMs (and directions in the case of \"directional GMMs\") are learned from the development data
   using the standard Expectation-Maximization (EM) algorithm (see e.g., [2]).

3. Given GMMs and directions, the sources are recovered via Wiener filtering, as described in [1].

COMPUTATIONAL TIME

Our Matlab implementation on 2.2 GHz CPU runs
  a. 120 seconds for \"Tamy - Que pena tanto faz\"
  b. 390 seconds for \"Bearlin - Roads\"

REFERENCES: 

[1] S. Arberet, A. Ozerov, R. Gribonval, F. Bimbot, \"Blind Spectral-GMM Estimation for Underdetermined Instantaneous Audio Source Separation\", ICA\'09, 2009, (submitted).

[2] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribonval, \"Adaptation of Bayesian models for single channel source separation and its application to voice / music separation in popular songs,\" IEEE Trans. on Audio, Speech and Lang. Proc., special issue on Blind Signal Proc. for Speech and Audio Applications, vol. 15, no. 5, pp. 1564-1578, July 2007.