Monthly Archives: April 2013

A new Bayesian method incorporating with local correlation for IBM estimation

2013-A new bayesian method incorporating with local correlation for IBM estimation

A lot of efforts have been made in the Ideal Binary Mask (IBM) estimation via statistical learning methods. The Bayesian method is a common one. However, one drawback is that the mask is estimated for each time-frequency unit independently. The correlation between units has not been fully taken into account. This paper attempts to consider the local correlation information between the mask labels of adjacent units directly. It is derived from a demonstrated assumption that units which belong to one segment are mainly dominated by one source. On the other hand, a local noise level tracking stage is incorporated. The local level is obtained by averaging among several adjacent units and can be considered by averaging among several adjacent units and can be considered as an approach to true noise energy. It is used as the intermediary auxiliary variable to indicate the correlation. While some secondary factors are omitted, the high dimensional posterior distribution is simulated by a Markov Chain Monte Carlo method.

The main computation goal of CASA has been set to obtain the ideal binary mask.

This paper uses a T-F representation of a bank of auditory filters in the form on a cochleagram. Under the T-F representation, the concept of IBM is directly motivated by the auditory masking phenomenon. Roughly speaking, the louder sound causes the weaker sound inaudible within a critical band. 

The threshold LC stands for local signal-to-noise ratio in dB. Varying LC leads to different IBMs and many researchers focus on the selection of this threshold. In [21], the authors suggested that the IBM defined by -6 dB criterion produces dramatically intelligibility improvement. The study in [24], [27] showed that IBM gives the optimal SNR gain under 0 dB threshold. Generally, we could start with 0 dB and vary it unless necessary.

The input signal is decomposed into frequency domain with 64-channel gammatone filters which are standard model of cochlear filtering. The center frequencies equally distributed on the rectangular bandwidth scale from 50 Hz to 8000 Hz.

IBM estimation which is the main goal of CASA can be viewed as a binary classification problem.

Extracting accurate pitch contours from mixtures will improve the IBM estimation greatly.

This paper focus on the IBM estimation while pitch is given.

In this paper, T-F segmentation and the noise level tracking are used to depict the correlation between adjacent units from different perspectives.



MMSE Based Missing Feature Reconstruction With Temporal Modeling for Robust Speech Recognition

2013-MMSE based missing feature reconstruction with temporal modeling for robust speech recognition

This paper proposal a temporal modeling for missing feature reconstruction using MMSE. It falls into the feature imputation category of the missing feature theory. This paper only focuses on the second stage of the missing feature theory, i.e. when the masks are known (either oracle or estimated), how can the masked features be used for recognition. This kind of technique try to reconstruct the masked unreliable features using the noisy feature and the masked speech feature. The estimation of the noise masks are not explored, the oracle and a simple beginning and ending based noise estimates are tested.

The missing data approach to noise robust speech recognition assumes that the log-spectral features (FBanks) can be either almost unaffected by noise or completely masked by it.

The performance of automatic speech recognition systems degrades rapidly when they operate under conditions that differ from those used for training. One source of the mismatch that still remains as a major issue among the ASR research community is additive noise.

Accomplishing noise robustness is a key issue to make these systems deplorable in real world conditions.

Although marginalization performs optimal decoding with missing features, it suffers from two main drawbacks. First, the standard decoding algorithm must be modified to account for missing features. Second, recognition has to be carried out with spectral features. However, it is well known that cepstral features outperforms spectral ones for speech recognition. The acoustic model needs to employ Gaussian mixtures with full covariance matrices or an increased number of Gaussian with diagonal covariance.

For GMM based systems, the feature imputation is compulsory while for DNNs they themselves are capable of handling the missing features. Thus no imputation stage is required. The only focus would thus be how to estimate perfect masks.

A GMM is used to represent clean speech and a minimum mean square error criterion is adopted to obtain suitable estimates of the unreliable features.

Missing data assumption derived mathematically:

Screenshot from 2013-04-23 12:22:00

Both oracle and estimated masks are tested. The oracle mask is obtained by direct comparison between the clean and noisy utterances using a threshold of 7 dB SNR. In this paper, the mask estimation and spectral reconstruction is in the log mel filterbank domain and 23 filterbank channels are used. Acoustic models trained on clean speech are employed in each task. For Aurora 4, bigram LM is used for decoding.

Missing data masks are computed from noise estimates obtained through linear interpolation of initial noise statistics extracted from the beginning and final frames of every utterance.

The improvement using oracle masks is especially noticeable at medium and low SNRs. Thereby, the mismatch due to noise can be effectively reduced with only knowledge of the masking pattern.

Under proper knowledge of the masking pattern, the mismatch introduced by the noise can be significantly palliated just by a suitable exploitation of source correlations.

The proposed reconstruction techniques suffer little degradation with respect to the clean condition.

This poor performance is largely due to the simple noise estimation technique employed, which can not suitable account for non-stationary noise. However, this simple noise estimator has been useful to demonstrate the utility of the proposed temporal modeling for the MD approaches, which is able to improve the ASR performance over other MD systems consistently using the estimated masks as well as the oracle masks.



Accurate marginalization range for missing data recognition

2007-Accurate marginalization range for missing data recognition

The authors proposed a new missing data recognition approach, in which reduced marginalization intervals are computed for each possible mask. The set of all possible masks and intervals is obtained by clustering on a clean and noisy stereo training corpus. The main principle of the proposed approach consists in training accurate marginalization intervals that are as small as possible, in order to improve the precision of marginalization.

The spectral ratio X/Y between clean and noisy speech is computed on a stereo training corpus. This results in a time-frequency representation that provides for every noisy spectral feature the relative contribution of the clean speech energy. This ratio is related to the local SNR as follows:

Screenshot from 2013-04-17 13:57:47


The feature domain, which is also the marginalization domain of missing data, is the 12-bands Mel spectral domain with cube-root compression of the speech power. Temporal derivatives are further added, leading to a 24 dimensional feature vector.


Binary masking and speech intelligibility

2011-Binary masking and speech intelligibility

This thesis mainly discussed various aspects of using binary masking to improve the speech intelligibility. Several sections in the Introduction part give detailed systematic reviews for the binary masking technique:

4. Binary Masking

5. Sparsity of Speech

6. Oracle Masks

7. Application of the Binary Mask

8. Time-Frequency Masking

The noise robustness is formulated as a source separation task, “the cocktail party problem”, that separates the target speech from the interfering noise sounds.

Speech separation enhances the speech and reduces the background noise before transmission.

Human auditory system efficiently identifies and separates the sources prior to recognition at a higher level.

The decreased intelligibility can be compensated either by separating the target speech from the interfering sounds, by enhancement of the target speech, or by reducing the interfering sound.

Speech is robust and redundant which means that part of the speech sound can be lost or modified without negative impact on intelligibility. [Miller, George A., and J. C. R. Licklider. “The intelligibility of interrupted speech.” The Journal of the Acoustical Society of America 22 (1950): 167.][Warren, Richard M. “Perceptual restoration of missing speech sounds.”Science 167.3917 (1970): 392-393.][Howard‐Jones, Paul A., and Stuart Rosen. “Uncomodulated glimpsing in ‘‘checkerboard’’noise.” The Journal of the Acoustical Society of America 93 (1993): 2915.]

In binary masking, sound sources are assigned as either target or interferer in the time-frequency domain. The target sound (speech) is kept by using the value one in the binary mask, whereas the regions with the interferer are removed by using the value zero.

In short, binary masking is a method of applying a binary, frequency-dependent, and time-varying gain in a number of frequency channels, and the binary mask defines what to do when.

Estimation of the binary mask and application of the binary mask to carry out the source separation.

Oracle mask is used for binary masks calculated using a prior knowledge which is not available in most real-life applications. A major objection to the concept of oracle masks is that it is of no use in real-life applications because of the required a priori knowledge. However, the oracle masks establish an upper limit of performance, which makes them useful as references and goals for binary masking algorithms developed for real-life applications such as hearing aids.

The local SNR criterion is the threshold for classifying the time-frequency unit as dominated by the target or interferer sound and this threshold controls the amount of ones in the ideal binary mask.

The target binary mask can be calculated by comparing the target speech directly with the long-term average spectrum of the target speech.

The ideal binary mask requires the interferer to be available and will change depending on the type of interferer, whereas the target binary mask is calculated from the target sound only and therefore is independent of the interferer sound.

The masking is applied on the 64D Gammatone Filterbank on the ERB frequency scale features. 64 frequency channels are enough to achieve high intelligibility[Li, Ning, and Philipos C. Loizou. “Effect of spectral resolution on the intelligibility of ideal binary masked speech.” The Journal of the Acoustical Society of America 123.4 (2008): EL59-EL64.]. (Power or magnitude spectrum?) How it differs from the conventional mel triangular filterbank features?

Another set of features is the magnitude of the equal distance STFT frequencies.

Because the Gammatone filterbank resembles the processing in the human auditory system, it is often used for speech processing and perceptual studies. The STFT can also be used but has the drawback of requiring more frequency channels to obtain the same spectral resolution at low frequencies than the Gammatone filterbank.

The ideal binary mask has been shown to enable increased intelligibility in the ideal situation, whereas the Wiener filter, when tested under realistic conditions, shows an increase in quality while in most situations only preserving intelligibility.


An improved model of masking effects for robust speech recognition system

2013-An improved model of masking effects for robust speech recognition system

It integrates simultaneous masking, temporal masking and cepstral mean and variance normalization into ordinary mel-frequency cepstral coefficients feature extraction algorithm for robust speech recognition. The proposed method sharpens the power spectrum of the signal in both the frequency domain and the time domain.

One of the biggest obstacles to widespread use of automatic speech recognition technology is robustness to those mismatches between training data and testing data, which include environment noise, channel distortion and speaker variability. On the other hand, the human auditory system is extremely adept at all these mentioned situations. [Akbari Azirani, A., R. le Bouquin Jeannes, and G. Faucon. “Optimizing speech enhancement by exploiting masking properties of the human ear.” Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on. Vol. 1. IEEE, 1995.][Hermansky, Hynek. “Should recognizers have ears?.” Speech Communication25.1 (1998): 3-27.]

The simultaneous masking in this paper is applied on the power spectral features from FFT.

In temporal masking, forward masking is much more effective than backward masking.

The proposed front-end feature extraction is modified from MFCC model provided by Voicebox by integrating the lateral inhibition masking, temporal spectral averaging and forward masking, which is shown below:

Screenshot from 2013-04-16 14:25:15


The lateral inhibition is applied on mel-frequency scale. A combination of the original signals and the lateral inhibition outputs is used as the input signal to a recognizer.

Lateral inhibition is very sensitive to spectral changes and it will enhance the peaks in both speech signals and noise signals. Hence, temporal spectral averaging in spectral domain is applied.

The CMVN is very effective in practice where it compensates for long term spectral effects such as communication channel distortion.

In this paper, they also used 24 mel filter coefficients.


On noise masking for automatic missing data speech recognition: a survey and discussion

2007-On noise masking for automatic missing data speech recognition a survey and discussion

In order to restrict the range of methods, only the techniques using a single microphone are considered.

A similar principle is used in a few other robust methods, such as uncertainty decoding or detection-based recognition.

These so called missing data masks play a fundamental role in missing data recognition, and their quality has a strong impact on the final recognition accuracy.

Although in theory a mask can be defined for any parameterization domain, in practice, such a domain should map distinct frequency bands onto different feature space dimensions, so that a frequency-limited noise only affects a few dimensions of the feature space. This is typically true for frequency-like and wavelet-based parameterizations and for most auditory-inspired analysis front-ends. This is not the case in the cepstrum.

Several missing data studies have shown that soft masks give better results than hard masks.

When the mask is computed based on the local SnR given both the clean and noisy speech signals, the resulting mask is called an oracle mask.

Two common techniques for handling missing features during recognition: data imputation and data marginalization.

Training stochastic models of masks seem a promising approach to infer efficient masks.


Sparse imputation for noise robust speech recognition using soft masks

2009-Sparse imputation for noise robust speech recognition using soft masks

Using soft masks in the sparse imputation approach yields a substantial increase in recognition accuracy, most notably at low SNRs.

The general idea behind missing data techniques (MDT) is that it is possible to estimate – prior to decoding – which spectro-temporal elements of the acoustic representations are reliable (i.e. dominated by speech) and which are unreliable (i.e. dominated by background noise). These reliability estimates, referred to as spectrographic mask, can then be used to treat reliable and unreliable features differently, for instance for replacing the unreliable features by clean speech estimates (i.e. missing data imputation).

The speech representation is power spectrogram. In experiments with artificially added noise, the so called oracle masks can be computed directly using

Screenshot from 2013-04-15 19:39:29


where S(k,t) is the speech spectrogram for frequency band k at time t frame. N(k,t) is the spectrogram for noise. If log-spectral features are used, reliable noisy speech coefficients can be used directly as estimates of the clean speech features since log(|S(k,t)+N(k,t)|) = log(|S(k,t)(1+N(k,t)/S(k,t))|) = log(|S(k,t)|). Other mask estimation techniques can be found in [Cerisara, Christophe, Sébastien Demange, and Jean-Paul Haton. “On noise masking for automatic missing data speech recognition: A survey and discussion.” Computer Speech & Language 21.3 (2007): 443-457.]

The authors represented both the speech and the mask using sparse vectors, which is similar to the network models that reconstruct the inputs with hidden factors if they are also sparse. There is a close relation between these techniques and the network models are a more integrated way of doing this.

Comparing to the hard masking decisions, soft masks can be used. The simplest soft mask is apply the sigmoid function to the above hard formulation:

Screenshot from 2013-04-15 19:50:07


The authors converted the acoustic feature representations to a time normalized representation (a fixed number of acoustic feature frames) using spline interpolation. They have chosen 20log_{10}(\theta) = -3dB for the oracle mask.

The features used are mel frequency log power spectra (23 bands with center frequencies starting at 100Hz). After imputation of the missing static acoustic features, delta and delta-delta coefficients were calculated on these individual digits.