Authors
Michelsanti D.
Ph.D Thesis
PhD Series - Technical Faculty of IT and Design, Aalborg University
Abstract
Speech communication is often challenged by several sources of disturbance that surround a speaker and a listener involved in a conversation. One example is a cocktail party, in which a listener is immersed in an acoustically noisy environment, generally consisting of a target speaker, competing speakers, reverberations, and background noise. In this situation, two phenomena usually occur. On the one hand, the target speaker manifests a clear change in the way of speaking to maintain their speech intelligible; a tendency known as Lombard effect. On the other hand, the listener focuses their auditory attention on the speech of interest, while filtering out the other sounds; a phenomenon called cocktail party effect.
When the background noise level is sufficiently high and/or the listener is hearing impaired, the two aforementioned mechanisms do not guarantee an effective communication. The listener may benefit from using hearing aids, devices that, besides amplification, perform speech enhancement, which consists of extracting the speech of interest from a given degraded speech signal. Speech enhancement is traditionally addressed with techniques that consider only acoustic signals. However, important information can be extracted from the lip movements and the facial expressions of the target speaker, which are reliable cues even in presence of high levels of background noise. Therefore, speech enhancement systems that use both acoustic and visual information are able to outperform audio-only approaches.
In this thesis, we study the problem of audio-visual speech enhancement based on deep learning using one microphone and one camera. In particular, we propose a new taxonomy and perform an experimental analysis of training targets and objective functions for audio-visual speech enhancement systems based on deep learning. Furthermore, we investigate the impact of Lombard effect on a deep-learning-based speech enhancement approach from an acoustic and a visual perspective. Additionally, we propose a new algorithm to reconstruct the speech of interest from the silent video of the target speaker. Finally, we provide a systematic survey of audio-visual speech datasets, evaluation methods and audio-visual speech enhancement systems.