Robust speech recognition on intelligent mobile devices with dual-microphone

López Espejo, Iván

Robust speech recognition on intelligent mobile devices with dual-microphone

López Espejo, Iván

Zuzendaria:

Antonio Miguel Peinado Herreros Zuzendaria
Ángel Manuel Gómez García Zuzendarikidea

Defentsa unibertsitatea: Universidad de Granada

Fecha de defensa: 2017(e)ko iraila-(a)k 22

Epaimahaia:

José Luis Pérez Córdoba Presidentea
Isaac Manuel Álvarez Ruiz Idazkaria
P. Vera-Candeas Kidea
Ning Ma Kidea
Alberto Abad Gareta Kidea

Mota: Tesia

Teseo: 507961 DIALNET DIGIBUG editor

Laburpena

Despite the outstanding progress made on automatic speech recognition (ASR) throughout the last decades, noise-robust ASR still poses a challenge. Tackling with acoustic noise in ASR systems is more important than ever before for a twofold reason: 1) ASR technology has begun to be extensively integrated in intelligent mobile devices (IMDs) such as smartphones to easily accomplish different tasks (e.g. search-by-voice), and 2) IMDs can be used anywhere at any time, that is, under many different acoustic (noisy) conditions. On the other hand, with the aim of enhancing noisy speech, IMDs have begun to embed small microphone arrays, i.e. microphone arrays comprised of a few sensors close each other. These multi-sensor IMDs often embed one microphone (usually at their rear) intended to capture the acoustic environment more than the speaker’s voice. This is the so-called secondary microphone. While classical microphone array processing (also known as beamforming) may be used for noise-robust ASR purposes, it is reported in the literature that its performance is quite limited when considering very few sensors close each other, one of them being a secondary microphone. As a result, the main goal of this Thesis is to explore a new series of dual-channel algorithms exploiting a secondary sensor to improve ASR accuracy on IMDs being used in everyday noisy environments. First, three dual-channel power spectrum enhancement methods are developed to circumvent the limitations of related single-channel feature enhancement methods when applied to such a dual-microphone set-up. These proposals have been referred to as DCSS (Dual-Channel Spectral Subtraction), P-MVDR (Power-Minimum Variance Distortionless Response) and DSW (Dual-channel Spectral Weighting, based on Wiener filtering). In particular, DSW starts from a simple formulation in which it is assumed that the secondary microphone only captures noise and the existence of a homogeneous noise field. Since it is known that both assumptions are not accurate, the Wiener filter (WF)-based weighting is modified through 1) a bias correction term (to rectify the resulting spectral weights when a non-negligible speech component is present at the secondary channel), and 2) a noise equalization (inspired by MVDR beamforming) applied on the secondary channel before spectral weight computation. All of these techniques require knowledge of the relative speech gain (RSG) which relates the clean speech power spectra at the two channels. To obtain the RSG, a two-channel minimum mean square error (MMSE)-based estimator is also developed for this task in this Thesis. In addition, the vector Taylor series (VTS) approach for noise-robust ASR has been widely applied over the last two decades in a successful manner. Then, VTS feature compensation is extended to be performed on a dual-channel framework in a similar fashion to the aforementioned power spectrum enhancement methods. The overarching element of this dual-channel VTS method is the stacked formulation. From this, an MMSE-based estimator for the log-Mel clean speech features, which relies on a VTS expansion of a dual-channel speech distortion model, is developed. The superiority of our dual-channel approach with respect to the single-channel one is also shown in this Thesis. To conclude, two dual-channel deep learning-based contributions are presented to deal with the development of two complex (from an analytical point of view) tasks of a noise-robust ASR system. These tasks are missing-data mask and noise estimation, which are faced by taking benefit from the powerful modeling capabilities of deep neural networks (DNNs). More specifically, these DNNs exploit the power level difference (PLD) between the two available channels to efficiently obtain the corresponding estimates with good generalization ability. While missing-data mask and noise estimates can be employed in various ways for noise-robust ASR purposes, in this Thesis they are applied to spectral reconstruction and feature compensation, respectively. It should be highlighted that our contributions broadly showed an outstanding performance at low signal-to-noise ratios (SNRs), which makes them promising techniques to be used in highly noisy environments such as those where IMDs might be employed.