Incorporating masking modeling for noise robust automatic speech recognition

2009-Incorporating mask modeling for noise robust automatic speech recognition

Overview of robustness techniques:

There have been several different ways to improving noise robustness. Feature representation of the speech signal that is more robust to the effect of noise can be sought. Speech signal can be enhanced prior to its employment in the recognizer by techniques such as spectral subtraction, Wiener filtering, or MAP based enhancement. Assuming availability of some knowledge about the noise, noise-compensation techniques, can be applied in the feature or model domain to reduce the mismatch between the training and testing data. Considering that only information on the location of noise corrupted spectro-temporal elements is available (referred to mask), the missing feature theory (MFT) can be employed for improving noise robustness by marginalizing these elements in the observation probability calculation [Lippmann, Richard P., and Beth A. Carlson. “Using missing feature theory to actively select features for robust speech recognition with interruptions, filtering and noise.” Proc. Eurospeech. Vol. 97. 1997.][Cooke, Martin, et al. “Robust automatic speech recognition with missing and unreliable acoustic data.” Speech communication 34.3 (2001): 267-285.].

The spectral peaks are more robust to a broad-band noise than the spectral valleys or harmonicity information.

Experimental results show significant improvements in recognition performance in strong noisy conditions achieved by the models incorporating the mask modeling.

This paper investigates an incorporation of the mask modeling into an HMM-based automatic speech recognition system in noisy conditions. The problem is mathematically well formulated. In the evaluation, the oracle mask has been shown to provide significant recognition accuracy improvements in all noisy conditions. As the focus of the paper is mask integration, a simple mask estimation based on a noise-estimate and sub-band voicing information [Jancovic, Peter, and M. Kokuer. “Estimation of voicing-character of speech spectra based on spectral shape.” Signal Processing Letters, IEEE 14.1 (2007): 66-69.] is used and only moderate improvements are obtained.

The feature they used are log filter bank features. The question is how they derived the oracle mask?

This work is carried out on Aurora 2 and the authors have previously tested it on VCV Challenge. [Jancovic, P., and K. Münevver. “On the mask modeling and feature representation in the missing-feature ASR: evaluation on the consonant challenge.” Proceedings of Interspeech. 2008.]


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: