Simply speaking, this paper combines the pHEQ with noise masking for robust speech recognition. Firstly the noise masking is used to modify the SNR ratio to a fixed value and then the pHEQ is used to reduce the differences between training and testing observation cumulative density distributions.
Mismatch between training and test conditions deteriorates performance of speech recognizers.
Many methods have been proposed to improve the robustness of speech recognition systems by trying to invert the signal transformations caused by the channel differences between training and test conditions. Cepstral mean normalization (CMN) and mean and variance normalization (MVN) are two examples. CMN removes convolution distortions by subtracting the cepstral mean from the cepstral feature vectors [Huang, Xuedong, Alejandro Acero, and Hsiao-Wuen Hon. Spoken language processing. Vol. 15. New Jersey: Prentice Hall PTR, 2001.]. MVN extends on CMN by also normalizing the variance of the acoustic feature vectors [Viikki, Olli, and Kari Laurila. “Cepstral domain segmental feature vector normalization for noise robust speech recognition.” Speech Communication25.1 (1998): 133-147.]. Although CMN and MVN can be used to compensate for linear transformations, they are less effective when dealing with non-linear transformations resulting from the presence of for example additive noise in the channel.
Noise masking increases the accuracy of speech recognition systems in the presence of noise by masking out low-energy events. In [Van Compernolle, Dirk. “Noise adaptation in a hidden Markov model speech recognition system.” Computer Speech & Language 3.2 (1989): 151-167.] for example, this is achieved by adding small amounts of artificial noise to the clean speech signal in order to increase the noise immunity of the system.
Two noise power spectrum estimation algorithms:
1). Improved Minima Controlled Recursive Averaging (IMCRA) [Cohen, Israel. “Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging.” Speech and Audio Processing, IEEE Transactions on 11.5 (2003): 466-475.] is one method to estimate noise pwoer spectrum in an adverse environment.
2). The Rangachari noise estimation algorithm [Rangachari, Sundarrajan, and Philipos C. Loizou. “A noise-estimation algorithm for highly non-stationary environments.” Speech Communication 48.2 (2006): 220-231.]
In this paper, the pHEQ (parametric HEQ) is applied before the Mel-filter bank, i.e. the logarithm of the power spectrum.
Furthermore, our preliminary experiments show that any scaling of the logarithm spectrum due to different variances, which generates strong non-linear transformations in the power domain, deteriorates the results substantially. Thus the observed variances are replaced by the target variances directly estimated from the specific test utterance.
Two types of noise masking:
1) A noise masking value substitutes the output of the filter bank if the output falls below the masking value. [Klatt, D. “A digital filter bank for spectral matching.” Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’76.. Vol. 1. IEEE, 1976.]
2) Noise masking is implemented by adding extra artificial noise to the speech signal in order to attain the desired SNR. [Van Compernolle, Dirk. “Noise adaptation in a hidden Markov model speech recognition system.” Computer Speech & Language 3.2 (1989): 151-167.]
Comparing Instan2 with Instan shows the effect of a commonly used alternative when modifying the features: only modify the static features and keep the original dynamic features. This improves the clean speech result at the cost of the noisy speech results.
Tagged: Paper; Masking