Sparse imputation for noise robust speech recognition using soft masks

2009-Sparse imputation for noise robust speech recognition using soft masks

Using soft masks in the sparse imputation approach yields a substantial increase in recognition accuracy, most notably at low SNRs.

The general idea behind missing data techniques (MDT) is that it is possible to estimate – prior to decoding – which spectro-temporal elements of the acoustic representations are reliable (i.e. dominated by speech) and which are unreliable (i.e. dominated by background noise). These reliability estimates, referred to as spectrographic mask, can then be used to treat reliable and unreliable features differently, for instance for replacing the unreliable features by clean speech estimates (i.e. missing data imputation).

The speech representation is power spectrogram. In experiments with artificially added noise, the so called oracle masks can be computed directly using

Screenshot from 2013-04-15 19:39:29


where S(k,t) is the speech spectrogram for frequency band k at time t frame. N(k,t) is the spectrogram for noise. If log-spectral features are used, reliable noisy speech coefficients can be used directly as estimates of the clean speech features since log(|S(k,t)+N(k,t)|) = log(|S(k,t)(1+N(k,t)/S(k,t))|) = log(|S(k,t)|). Other mask estimation techniques can be found in [Cerisara, Christophe, Sébastien Demange, and Jean-Paul Haton. “On noise masking for automatic missing data speech recognition: A survey and discussion.” Computer Speech & Language 21.3 (2007): 443-457.]

The authors represented both the speech and the mask using sparse vectors, which is similar to the network models that reconstruct the inputs with hidden factors if they are also sparse. There is a close relation between these techniques and the network models are a more integrated way of doing this.

Comparing to the hard masking decisions, soft masks can be used. The simplest soft mask is apply the sigmoid function to the above hard formulation:

Screenshot from 2013-04-15 19:50:07


The authors converted the acoustic feature representations to a time normalized representation (a fixed number of acoustic feature frames) using spline interpolation. They have chosen 20log_{10}(\theta) = -3dB for the oracle mask.

The features used are mel frequency log power spectra (23 bands with center frequencies starting at 100Hz). After imputation of the missing static acoustic features, delta and delta-delta coefficients were calculated on these individual digits.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: