Category Archives: Important

MMSE Based Missing Feature Reconstruction With Temporal Modeling for Robust Speech Recognition

2013-MMSE based missing feature reconstruction with temporal modeling for robust speech recognition

This paper proposal a temporal modeling for missing feature reconstruction using MMSE. It falls into the feature imputation category of the missing feature theory. This paper only focuses on the second stage of the missing feature theory, i.e. when the masks are known (either oracle or estimated), how can the masked features be used for recognition. This kind of technique try to reconstruct the masked unreliable features using the noisy feature and the masked speech feature. The estimation of the noise masks are not explored, the oracle and a simple beginning and ending based noise estimates are tested.

The missing data approach to noise robust speech recognition assumes that the log-spectral features (FBanks) can be either almost unaffected by noise or completely masked by it.

The performance of automatic speech recognition systems degrades rapidly when they operate under conditions that differ from those used for training. One source of the mismatch that still remains as a major issue among the ASR research community is additive noise.

Accomplishing noise robustness is a key issue to make these systems deplorable in real world conditions.

Although marginalization performs optimal decoding with missing features, it suffers from two main drawbacks. First, the standard decoding algorithm must be modified to account for missing features. Second, recognition has to be carried out with spectral features. However, it is well known that cepstral features outperforms spectral ones for speech recognition. The acoustic model needs to employ Gaussian mixtures with full covariance matrices or an increased number of Gaussian with diagonal covariance.

For GMM based systems, the feature imputation is compulsory while for DNNs they themselves are capable of handling the missing features. Thus no imputation stage is required. The only focus would thus be how to estimate perfect masks.

A GMM is used to represent clean speech and a minimum mean square error criterion is adopted to obtain suitable estimates of the unreliable features.

Missing data assumption derived mathematically:

Screenshot from 2013-04-23 12:22:00

Both oracle and estimated masks are tested. The oracle mask is obtained by direct comparison between the clean and noisy utterances using a threshold of 7 dB SNR. In this paper, the mask estimation and spectral reconstruction is in the log mel filterbank domain and 23 filterbank channels are used. Acoustic models trained on clean speech are employed in each task. For Aurora 4, bigram LM is used for decoding.

Missing data masks are computed from noise estimates obtained through linear interpolation of initial noise statistics extracted from the beginning and final frames of every utterance.

The improvement using oracle masks is especially noticeable at medium and low SNRs. Thereby, the mismatch due to noise can be effectively reduced with only knowledge of the masking pattern.

Under proper knowledge of the masking pattern, the mismatch introduced by the noise can be significantly palliated just by a suitable exploitation of source correlations.

The proposed reconstruction techniques suffer little degradation with respect to the clean condition.

This poor performance is largely due to the simple noise estimation technique employed, which can not suitable account for non-stationary noise. However, this simple noise estimator has been useful to demonstrate the utility of the proposed temporal modeling for the MD approaches, which is able to improve the ASR performance over other MD systems consistently using the estimated masks as well as the oracle masks.



Binary masking and speech intelligibility

2011-Binary masking and speech intelligibility

This thesis mainly discussed various aspects of using binary masking to improve the speech intelligibility. Several sections in the Introduction part give detailed systematic reviews for the binary masking technique:

4. Binary Masking

5. Sparsity of Speech

6. Oracle Masks

7. Application of the Binary Mask

8. Time-Frequency Masking

The noise robustness is formulated as a source separation task, “the cocktail party problem”, that separates the target speech from the interfering noise sounds.

Speech separation enhances the speech and reduces the background noise before transmission.

Human auditory system efficiently identifies and separates the sources prior to recognition at a higher level.

The decreased intelligibility can be compensated either by separating the target speech from the interfering sounds, by enhancement of the target speech, or by reducing the interfering sound.

Speech is robust and redundant which means that part of the speech sound can be lost or modified without negative impact on intelligibility. [Miller, George A., and J. C. R. Licklider. “The intelligibility of interrupted speech.” The Journal of the Acoustical Society of America 22 (1950): 167.][Warren, Richard M. “Perceptual restoration of missing speech sounds.”Science 167.3917 (1970): 392-393.][Howard‐Jones, Paul A., and Stuart Rosen. “Uncomodulated glimpsing in ‘‘checkerboard’’noise.” The Journal of the Acoustical Society of America 93 (1993): 2915.]

In binary masking, sound sources are assigned as either target or interferer in the time-frequency domain. The target sound (speech) is kept by using the value one in the binary mask, whereas the regions with the interferer are removed by using the value zero.

In short, binary masking is a method of applying a binary, frequency-dependent, and time-varying gain in a number of frequency channels, and the binary mask defines what to do when.

Estimation of the binary mask and application of the binary mask to carry out the source separation.

Oracle mask is used for binary masks calculated using a prior knowledge which is not available in most real-life applications. A major objection to the concept of oracle masks is that it is of no use in real-life applications because of the required a priori knowledge. However, the oracle masks establish an upper limit of performance, which makes them useful as references and goals for binary masking algorithms developed for real-life applications such as hearing aids.

The local SNR criterion is the threshold for classifying the time-frequency unit as dominated by the target or interferer sound and this threshold controls the amount of ones in the ideal binary mask.

The target binary mask can be calculated by comparing the target speech directly with the long-term average spectrum of the target speech.

The ideal binary mask requires the interferer to be available and will change depending on the type of interferer, whereas the target binary mask is calculated from the target sound only and therefore is independent of the interferer sound.

The masking is applied on the 64D Gammatone Filterbank on the ERB frequency scale features. 64 frequency channels are enough to achieve high intelligibility[Li, Ning, and Philipos C. Loizou. “Effect of spectral resolution on the intelligibility of ideal binary masked speech.” The Journal of the Acoustical Society of America 123.4 (2008): EL59-EL64.]. (Power or magnitude spectrum?) How it differs from the conventional mel triangular filterbank features?

Another set of features is the magnitude of the equal distance STFT frequencies.

Because the Gammatone filterbank resembles the processing in the human auditory system, it is often used for speech processing and perceptual studies. The STFT can also be used but has the drawback of requiring more frequency channels to obtain the same spectral resolution at low frequencies than the Gammatone filterbank.

The ideal binary mask has been shown to enable increased intelligibility in the ideal situation, whereas the Wiener filter, when tested under realistic conditions, shows an increase in quality while in most situations only preserving intelligibility.


An improved model of masking effects for robust speech recognition system

2013-An improved model of masking effects for robust speech recognition system

It integrates simultaneous masking, temporal masking and cepstral mean and variance normalization into ordinary mel-frequency cepstral coefficients feature extraction algorithm for robust speech recognition. The proposed method sharpens the power spectrum of the signal in both the frequency domain and the time domain.

One of the biggest obstacles to widespread use of automatic speech recognition technology is robustness to those mismatches between training data and testing data, which include environment noise, channel distortion and speaker variability. On the other hand, the human auditory system is extremely adept at all these mentioned situations. [Akbari Azirani, A., R. le Bouquin Jeannes, and G. Faucon. “Optimizing speech enhancement by exploiting masking properties of the human ear.” Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on. Vol. 1. IEEE, 1995.][Hermansky, Hynek. “Should recognizers have ears?.” Speech Communication25.1 (1998): 3-27.]

The simultaneous masking in this paper is applied on the power spectral features from FFT.

In temporal masking, forward masking is much more effective than backward masking.

The proposed front-end feature extraction is modified from MFCC model provided by Voicebox by integrating the lateral inhibition masking, temporal spectral averaging and forward masking, which is shown below:

Screenshot from 2013-04-16 14:25:15


The lateral inhibition is applied on mel-frequency scale. A combination of the original signals and the lateral inhibition outputs is used as the input signal to a recognizer.

Lateral inhibition is very sensitive to spectral changes and it will enhance the peaks in both speech signals and noise signals. Hence, temporal spectral averaging in spectral domain is applied.

The CMVN is very effective in practice where it compensates for long term spectral effects such as communication channel distortion.

In this paper, they also used 24 mel filter coefficients.


Sparse imputation for noise robust speech recognition using soft masks

2009-Sparse imputation for noise robust speech recognition using soft masks

Using soft masks in the sparse imputation approach yields a substantial increase in recognition accuracy, most notably at low SNRs.

The general idea behind missing data techniques (MDT) is that it is possible to estimate – prior to decoding – which spectro-temporal elements of the acoustic representations are reliable (i.e. dominated by speech) and which are unreliable (i.e. dominated by background noise). These reliability estimates, referred to as spectrographic mask, can then be used to treat reliable and unreliable features differently, for instance for replacing the unreliable features by clean speech estimates (i.e. missing data imputation).

The speech representation is power spectrogram. In experiments with artificially added noise, the so called oracle masks can be computed directly using

Screenshot from 2013-04-15 19:39:29


where S(k,t) is the speech spectrogram for frequency band k at time t frame. N(k,t) is the spectrogram for noise. If log-spectral features are used, reliable noisy speech coefficients can be used directly as estimates of the clean speech features since log(|S(k,t)+N(k,t)|) = log(|S(k,t)(1+N(k,t)/S(k,t))|) = log(|S(k,t)|). Other mask estimation techniques can be found in [Cerisara, Christophe, Sébastien Demange, and Jean-Paul Haton. “On noise masking for automatic missing data speech recognition: A survey and discussion.” Computer Speech & Language 21.3 (2007): 443-457.]

The authors represented both the speech and the mask using sparse vectors, which is similar to the network models that reconstruct the inputs with hidden factors if they are also sparse. There is a close relation between these techniques and the network models are a more integrated way of doing this.

Comparing to the hard masking decisions, soft masks can be used. The simplest soft mask is apply the sigmoid function to the above hard formulation:

Screenshot from 2013-04-15 19:50:07


The authors converted the acoustic feature representations to a time normalized representation (a fixed number of acoustic feature frames) using spline interpolation. They have chosen 20log_{10}(\theta) = -3dB for the oracle mask.

The features used are mel frequency log power spectra (23 bands with center frequencies starting at 100Hz). After imputation of the missing static acoustic features, delta and delta-delta coefficients were calculated on these individual digits.


Incorporating masking modeling for noise robust automatic speech recognition

2009-Incorporating mask modeling for noise robust automatic speech recognition

Overview of robustness techniques:

There have been several different ways to improving noise robustness. Feature representation of the speech signal that is more robust to the effect of noise can be sought. Speech signal can be enhanced prior to its employment in the recognizer by techniques such as spectral subtraction, Wiener filtering, or MAP based enhancement. Assuming availability of some knowledge about the noise, noise-compensation techniques, can be applied in the feature or model domain to reduce the mismatch between the training and testing data. Considering that only information on the location of noise corrupted spectro-temporal elements is available (referred to mask), the missing feature theory (MFT) can be employed for improving noise robustness by marginalizing these elements in the observation probability calculation [Lippmann, Richard P., and Beth A. Carlson. “Using missing feature theory to actively select features for robust speech recognition with interruptions, filtering and noise.” Proc. Eurospeech. Vol. 97. 1997.][Cooke, Martin, et al. “Robust automatic speech recognition with missing and unreliable acoustic data.” Speech communication 34.3 (2001): 267-285.].

The spectral peaks are more robust to a broad-band noise than the spectral valleys or harmonicity information.

Experimental results show significant improvements in recognition performance in strong noisy conditions achieved by the models incorporating the mask modeling.

This paper investigates an incorporation of the mask modeling into an HMM-based automatic speech recognition system in noisy conditions. The problem is mathematically well formulated. In the evaluation, the oracle mask has been shown to provide significant recognition accuracy improvements in all noisy conditions. As the focus of the paper is mask integration, a simple mask estimation based on a noise-estimate and sub-band voicing information [Jancovic, Peter, and M. Kokuer. “Estimation of voicing-character of speech spectra based on spectral shape.” Signal Processing Letters, IEEE 14.1 (2007): 66-69.] is used and only moderate improvements are obtained.

The feature they used are log filter bank features. The question is how they derived the oracle mask?

This work is carried out on Aurora 2 and the authors have previously tested it on VCV Challenge. [Jancovic, P., and K. Münevver. “On the mask modeling and feature representation in the missing-feature ASR: evaluation on the consonant challenge.” Proceedings of Interspeech. 2008.]