We are witnessing a massive deployment of microphones in our daily lives and in a number of emerging technologies such as smart-homes and humanoids robots. These systems underpin exciting applications such as human-machine interaction with natural language or automatic auditory scene understanding. This multiplication of sensors raises important scientific questions: How to process signals recorded “out-of-the-lab”, i.e., in unconstrained environments rather than in controlled settings? How to efficiently exploit signals recorded by many microphones?
Classical methods strongly rely on a good knowledge of the geometry of the audio scene, i.e., what are the positions of the sources, the sensors, and how does the sound propagate between them. A good geometrical model can be easily obtained when the microphone configuration is perfectly known and the sound propagates in a straight line from each source to each sensor. This task becomes much more difficult in realistic scenarios, where the environment may be unknown, cluttered, dynamic, and include multiple sources, diffuse sounds, noise and/or reverberations.
Recently, two interesting directions have emerged and were investigated in our team. The first one is physics-driven . This approach explicitly solves the wave propagation equation in a given simplified yet realistic environment assuming that only few sound sources are present, in order to recover the positions of sources, sensors, or even some of the wall absorption properties. Encouraging results were obtained in simulated settings, including “hearing behind walls” . However, these methods rely on approximate models and on partial knowledge of the system (e.g. room dimensions), limiting their real-world applicability so far.
The second direction is data-driven. It uses machine learning to bypass the use of a physical model by directly estimating a mapping from acoustic features to source positions, using training data obtained in a real room [3, 7]. These methods can in principle work in arbitrarily complex environments, but they require carefully annotated training datasets. Since obtaining such data is time consuming, the methods are usually working well for one specific room and setup, and are hard to generalize in practice. While massive annotated datasets exist for speech recognition, this is not the case for lower level auditory scene analysis tasks as of today.
Building on these recent ideas, the candidate will explore the novel concept of virtual acoustic space learning [4,5], taking the best of both physics-driven and data-driven worlds. The central question of this thesis is: “Can a ‘virtual agent’ gather training data from simulated audio scenes, learn models from them, and use these models to recover geometrical information from real-world recordings?” The candidate will develop, implement and test methods that will be trained on simulated data corresponding to different source and sensor positions within rooms of different shapes, sizes and properties. The initial task will be sound source localization in simple rooms, but the possibility to generate millions of simulated audio scenes will allow to tackle more challenging and less studied problems. These may include the estimation of source distances, orientation, trajectories, diffusivity, number of sources, reverberation level, or even the presence of interfering walls or objects. Real-world test platforms may include the humanoid robot NAO, equipped with 4 microphones and the ability to walk and speak.
The methodological grounds of the thesis will be articulated in three work packages addressed incrementally and in parallel.
A. Audio Scene Geometry Learning. To learn a mapping from high-dimensional acoustic features to low-dimensional scene properties, the candidate will start building on a general probabilistic regression model developed in the team, referred to as Gaussian Locally-Linear Mapping (GLLiM) . This framework was already successfully used for virtually-supervised and real-world binaural source localization [3,4,5]. Extensions of the learning method will be required in order to scale it to larger input/output dimensions, larger datasets and spherical spaces. Recent ideas in manifold-based sound source localization  will also be investigated to adapt virtually-learned models to partially-annotated real world data. Dynamic scenes involving moving sources, sensors, or environmental changes may be studied as well using tracking, model selection or classification.
B. Extensive Evaluation on Simulated and Real Data. The estimation of audio scene geometry will be tackled incrementally, with increasing difficulty, first on simulated and then on real data. The candidate will benefit from already existing software for audio scene simulation. The first task will be to localize a sound source in a simple room with varying microphones and walls configurations. The possibility to perfectly control simulated environments will allow to progressively move to more and more challenging tasks, to ultimately apply the framework to real-world audio scene analysis tasks with the humanoid robot NAO (4 microphones).
C. Designing Relevant Acoustic Features. An important question will be which audio features to use. The most widely used ones in binaural sound source localization are inter-microphone phase and level differences, and will serve as starting point. It has been recently suggested that more robust features could be used when more sensors and sources are present [8,9]. Designing appropriate features for other properties such as diffuseness or room-reverberation is an open question. The possibility to learn these features using deep neural networks or to design them based on appropriate physical models will be explored.