An improved model of masking effects for robust speech recognition system

2013-An improved model of masking effects for robust speech recognition system

It integrates simultaneous masking, temporal masking and cepstral mean and variance normalization into ordinary mel-frequency cepstral coefficients feature extraction algorithm for robust speech recognition. The proposed method sharpens the power spectrum of the signal in both the frequency domain and the time domain.

One of the biggest obstacles to widespread use of automatic speech recognition technology is robustness to those mismatches between training data and testing data, which include environment noise, channel distortion and speaker variability. On the other hand, the human auditory system is extremely adept at all these mentioned situations. [Akbari Azirani, A., R. le Bouquin Jeannes, and G. Faucon. “Optimizing speech enhancement by exploiting masking properties of the human ear.” Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on. Vol. 1. IEEE, 1995.][Hermansky, Hynek. “Should recognizers have ears?.” Speech Communication25.1 (1998): 3-27.]

The simultaneous masking in this paper is applied on the power spectral features from FFT.

In temporal masking, forward masking is much more effective than backward masking.

The proposed front-end feature extraction is modified from MFCC model provided by Voicebox by integrating the lateral inhibition masking, temporal spectral averaging and forward masking, which is shown below:

Screenshot from 2013-04-16 14:25:15

 

The lateral inhibition is applied on mel-frequency scale. A combination of the original signals and the lateral inhibition outputs is used as the input signal to a recognizer.

Lateral inhibition is very sensitive to spectral changes and it will enhance the peaks in both speech signals and noise signals. Hence, temporal spectral averaging in spectral domain is applied.

The CMVN is very effective in practice where it compensates for long term spectral effects such as communication channel distortion.

In this paper, they also used 24 mel filter coefficients.

 

Advertisements

On noise masking for automatic missing data speech recognition: a survey and discussion

2007-On noise masking for automatic missing data speech recognition a survey and discussion

In order to restrict the range of methods, only the techniques using a single microphone are considered.

A similar principle is used in a few other robust methods, such as uncertainty decoding or detection-based recognition.

These so called missing data masks play a fundamental role in missing data recognition, and their quality has a strong impact on the final recognition accuracy.

Although in theory a mask can be defined for any parameterization domain, in practice, such a domain should map distinct frequency bands onto different feature space dimensions, so that a frequency-limited noise only affects a few dimensions of the feature space. This is typically true for frequency-like and wavelet-based parameterizations and for most auditory-inspired analysis front-ends. This is not the case in the cepstrum.

Several missing data studies have shown that soft masks give better results than hard masks.

When the mask is computed based on the local SnR given both the clean and noisy speech signals, the resulting mask is called an oracle mask.

Two common techniques for handling missing features during recognition: data imputation and data marginalization.

Training stochastic models of masks seem a promising approach to infer efficient masks.

 

Sparse imputation for noise robust speech recognition using soft masks

2009-Sparse imputation for noise robust speech recognition using soft masks

Using soft masks in the sparse imputation approach yields a substantial increase in recognition accuracy, most notably at low SNRs.

The general idea behind missing data techniques (MDT) is that it is possible to estimate – prior to decoding – which spectro-temporal elements of the acoustic representations are reliable (i.e. dominated by speech) and which are unreliable (i.e. dominated by background noise). These reliability estimates, referred to as spectrographic mask, can then be used to treat reliable and unreliable features differently, for instance for replacing the unreliable features by clean speech estimates (i.e. missing data imputation).

The speech representation is power spectrogram. In experiments with artificially added noise, the so called oracle masks can be computed directly using

Screenshot from 2013-04-15 19:39:29

 

where S(k,t) is the speech spectrogram for frequency band k at time t frame. N(k,t) is the spectrogram for noise. If log-spectral features are used, reliable noisy speech coefficients can be used directly as estimates of the clean speech features since log(|S(k,t)+N(k,t)|) = log(|S(k,t)(1+N(k,t)/S(k,t))|) = log(|S(k,t)|). Other mask estimation techniques can be found in [Cerisara, Christophe, Sébastien Demange, and Jean-Paul Haton. “On noise masking for automatic missing data speech recognition: A survey and discussion.” Computer Speech & Language 21.3 (2007): 443-457.]

The authors represented both the speech and the mask using sparse vectors, which is similar to the network models that reconstruct the inputs with hidden factors if they are also sparse. There is a close relation between these techniques and the network models are a more integrated way of doing this.

Comparing to the hard masking decisions, soft masks can be used. The simplest soft mask is apply the sigmoid function to the above hard formulation:

Screenshot from 2013-04-15 19:50:07

 

The authors converted the acoustic feature representations to a time normalized representation (a fixed number of acoustic feature frames) using spline interpolation. They have chosen 20log_{10}(\theta) = -3dB for the oracle mask.

The features used are mel frequency log power spectra (23 bands with center frequencies starting at 100Hz). After imputation of the missing static acoustic features, delta and delta-delta coefficients were calculated on these individual digits.

 

Incorporating masking modeling for noise robust automatic speech recognition

2009-Incorporating mask modeling for noise robust automatic speech recognition

Overview of robustness techniques:

There have been several different ways to improving noise robustness. Feature representation of the speech signal that is more robust to the effect of noise can be sought. Speech signal can be enhanced prior to its employment in the recognizer by techniques such as spectral subtraction, Wiener filtering, or MAP based enhancement. Assuming availability of some knowledge about the noise, noise-compensation techniques, can be applied in the feature or model domain to reduce the mismatch between the training and testing data. Considering that only information on the location of noise corrupted spectro-temporal elements is available (referred to mask), the missing feature theory (MFT) can be employed for improving noise robustness by marginalizing these elements in the observation probability calculation [Lippmann, Richard P., and Beth A. Carlson. “Using missing feature theory to actively select features for robust speech recognition with interruptions, filtering and noise.” Proc. Eurospeech. Vol. 97. 1997.][Cooke, Martin, et al. “Robust automatic speech recognition with missing and unreliable acoustic data.” Speech communication 34.3 (2001): 267-285.].

The spectral peaks are more robust to a broad-band noise than the spectral valleys or harmonicity information.

Experimental results show significant improvements in recognition performance in strong noisy conditions achieved by the models incorporating the mask modeling.

This paper investigates an incorporation of the mask modeling into an HMM-based automatic speech recognition system in noisy conditions. The problem is mathematically well formulated. In the evaluation, the oracle mask has been shown to provide significant recognition accuracy improvements in all noisy conditions. As the focus of the paper is mask integration, a simple mask estimation based on a noise-estimate and sub-band voicing information [Jancovic, Peter, and M. Kokuer. “Estimation of voicing-character of speech spectra based on spectral shape.” Signal Processing Letters, IEEE 14.1 (2007): 66-69.] is used and only moderate improvements are obtained.

The feature they used are log filter bank features. The question is how they derived the oracle mask?

This work is carried out on Aurora 2 and the authors have previously tested it on VCV Challenge. [Jancovic, P., and K. Münevver. “On the mask modeling and feature representation in the missing-feature ASR: evaluation on the consonant challenge.” Proceedings of Interspeech. 2008.]

Histogram Equalization and Noise Masking for Robust Speech Recognition

2010-Histogram equalization and noise masking for robust speech recognition

Simply speaking, this paper combines the pHEQ with noise masking for robust speech recognition. Firstly the noise masking is used to modify the SNR ratio to a fixed value and then the pHEQ is used to reduce the differences between training and testing observation cumulative density distributions.

Mismatch between training and test conditions deteriorates performance of speech recognizers.

Many methods have been proposed to improve the robustness of speech recognition systems by trying to invert the signal transformations caused by the channel differences between training and test conditions. Cepstral mean normalization (CMN) and mean and variance normalization (MVN) are two examples. CMN removes convolution distortions by subtracting the cepstral mean from the cepstral feature vectors [Huang, Xuedong, Alejandro Acero, and Hsiao-Wuen Hon. Spoken language processing. Vol. 15. New Jersey: Prentice Hall PTR, 2001.]. MVN extends on CMN by also normalizing the variance of the acoustic feature vectors [Viikki, Olli, and Kari Laurila. “Cepstral domain segmental feature vector normalization for noise robust speech recognition.” Speech Communication25.1 (1998): 133-147.]. Although CMN and MVN can be used to compensate for linear transformations, they are less effective when dealing with non-linear transformations resulting from the presence of for example additive noise in the channel.

Noise masking increases the accuracy of speech recognition systems in the presence of noise by masking out low-energy events. In [Van Compernolle, Dirk. “Noise adaptation in a hidden Markov model speech recognition system.” Computer Speech & Language 3.2 (1989): 151-167.] for example, this is achieved by adding small amounts of artificial noise to the clean speech signal in order to increase the noise immunity of the system.

Two noise power spectrum estimation algorithms:

1). Improved Minima Controlled Recursive Averaging (IMCRA) [Cohen, Israel. “Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging.” Speech and Audio Processing, IEEE Transactions on 11.5 (2003): 466-475.] is one method to estimate noise pwoer spectrum in an adverse environment.

2). The Rangachari noise estimation algorithm [Rangachari, Sundarrajan, and Philipos C. Loizou. “A noise-estimation algorithm for highly non-stationary environments.” Speech Communication 48.2 (2006): 220-231.]

In this paper, the pHEQ (parametric HEQ) is applied before the Mel-filter bank, i.e. the logarithm of the power spectrum.

Screenshot from 2013-04-15 16:58:18

 

Furthermore, our preliminary experiments show that any scaling of the logarithm spectrum due to different variances, which generates strong non-linear transformations in the power domain, deteriorates the results substantially. Thus the observed variances are replaced by the target variances directly estimated from the specific test utterance.

Two types of noise masking:

1) A noise masking value substitutes the output of the filter bank if the output falls below the masking value. [Klatt, D. “A digital filter bank for spectral matching.” Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’76.. Vol. 1. IEEE, 1976.]

2) Noise masking is implemented by adding extra artificial noise to the speech signal in order to attain the desired SNR. [Van Compernolle, Dirk. “Noise adaptation in a hidden Markov model speech recognition system.” Computer Speech & Language 3.2 (1989): 151-167.]

Comparing Instan2 with Instan shows the effect of a commonly used alternative when modifying the features: only modify the static features and keep the original dynamic features. This improves the clean speech result at the cost of the noisy speech results.

 

Robust ASR based on Clean Speech Models: An evaluation of missing data techniques for connected digit recognition in noise

2001 – Robust ASR Based On Clean Speech Models An Evaluation of Missing Data Techniques For Connected Digit Recognition in Noise

Spectral features are particularly sensitive to gender differences. This is illustrated in Figure 2 which has been constructed by taking a clean speech utterance and force-aligning it to single-component Gaussian models trained on either male speakers or female speakers. By outputting means of each model state in the alignment it is possible to effectively reconstruct the spectral representation as if the speech had been spoken by either the prototypical male or prototypical female speaker. This technique allows us to clearly see the differences in what the models have learnt about male and female speech.

 

Three masking techniques are investigated:

1) discrete SNR masks: Specially, the filter bank features are converted into the spectral amplitude domain, the first ten frames are averaged to form a stationary noise estimate and this estimate is subtracted from the noisy signal to form a clean signal estimate. The ratio of these two estimates forms the local SNR. Features are labelled “reliable” if they have a local SNR greater than a threshold of 7 dB, otherwise they are labelled as “unreliable”. The 7dB threshold has  been empiricially shown to be near optimal in previous work using different data and different noise types.

 

2) Soft SNR Masks: the values for the soft masks have been generated by compressing the local SNR with a sigmoid function with empirically derived parameters.

 

3) Combined harmonicity and SRN masks: details refer to the paper.