HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Journal articles

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Mostafa Sadeghi 1, 2 Xavier Alameda-Pineda 1, 3
1 PERCEPTION - Interpretation and Modelling of Images and Videos
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, Grenoble INP - Institut polytechnique de Grenoble - Grenoble Institute of Technology
2 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : In this paper, we are interested in unsupervised (unknown noise) speech enhancement using latent variable generative models. We propose to learn a generative model for clean speech spectrogram based on a variational autoencoder (VAE) where a mixture of audio and visual networks is used to infer the posterior of the latent variables. This is motivated by the fact that visual data, i.e. lips images of the speaker, provide helpful and complementary information about speech. As such, they can help train a richer inference network, where the audio and visual information are fused. Moreover, during speech enhancement, visual data are used to initialize the latent variables, thus providing a more robust initialization than using the noisy speech spectrogram. A variational inference approach is derived to train the proposed VAE. Thanks to the novel inference procedure and the robust initialization, the proposed audio-visual VAE exhibits superior performance on speech enhancement than using the standard audio-only counterpart.
Complete list of metadata

Contributor : Xavier Alameda-Pineda Connect in order to contact the contributor
Submitted on : Wednesday, January 26, 2022 - 11:41:26 AM
Last modification on : Wednesday, May 4, 2022 - 11:58:03 AM


Files produced by the author(s)



Mostafa Sadeghi, Xavier Alameda-Pineda. Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement. IEEE Transactions on Signal Processing, Institute of Electrical and Electronics Engineers, 2021, 69, pp.1899-1909. ⟨10.1109/TSP.2021.3066038⟩. ⟨hal-02926172v2⟩



Record views


Files downloads