It integrates simultaneous masking, temporal masking and cepstral mean and variance normalization into ordinary mel-frequency cepstral coefficients feature extraction algorithm for robust speech recognition. The proposed method sharpens the power spectrum of the signal in both the frequency domain and the time domain.
One of the biggest obstacles to widespread use of automatic speech recognition technology is robustness to those mismatches between training data and testing data, which include environment noise, channel distortion and speaker variability. On the other hand, the human auditory system is extremely adept at all these mentioned situations. [Akbari Azirani, A., R. le Bouquin Jeannes, and G. Faucon. “Optimizing speech enhancement by exploiting masking properties of the human ear.” Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on. Vol. 1. IEEE, 1995.][Hermansky, Hynek. “Should recognizers have ears?.” Speech Communication25.1 (1998): 3-27.]
The simultaneous masking in this paper is applied on the power spectral features from FFT.
In temporal masking, forward masking is much more effective than backward masking.
The proposed front-end feature extraction is modified from MFCC model provided by Voicebox by integrating the lateral inhibition masking, temporal spectral averaging and forward masking, which is shown below:
The lateral inhibition is applied on mel-frequency scale. A combination of the original signals and the lateral inhibition outputs is used as the input signal to a recognizer.
Lateral inhibition is very sensitive to spectral changes and it will enhance the peaks in both speech signals and noise signals. Hence, temporal spectral averaging in spectral domain is applied.
The CMVN is very effective in practice where it compensates for long term spectral effects such as communication channel distortion.
In this paper, they also used 24 mel filter coefficients.