Category Archives: Paper

GRBM and automatic feature extraction for noise robust missing data mask estimation

GRBM and automatic feature extraction for noise robust missing data mask estimation

In this paper, GRBM is used as an unsupervised feature extractor to automatically discover features for the prediction of IBM. The GRBM extracted features are then fed into SVM classifiers to predict whether a specific time-frequency unit of the input speech signal should be masked out or not.

Some of the previous work on the field have considered mask estimation as a binary classification problem by training machine learning based classifiers such as GMMs or SVMs with several acoustic features in conjunction. There multi-feature approaches counteract the adverse environmental factors with their comprehensive set of features – cues discriminating between speech and non-speech are effective in non-speech noisy environments [4], whereas directional cues provide information on competing speakers [5,7].

As an alternative to basing the multi-feature approach on a set of “design” features, a GRBM can be trained to learn the acoustical patterns for an arguable better performing set of features.

Ultimately, the confrontation between design and automatically learned features reduces to quantity versus quality; the discrimination power of a single automatically learned feature may be small but the number of them can be made arbitrarily large, whereas a single design feature such as interaural time difference or interaural level difference may be effective alone but the overall number of them is usually much smaller.

In some of the previous studies, the common approach for time-frequency unit classification has been to develop descriptive heuristic measures, or design features, some of which are processed through a rather complex model [4,5]. However, relevant information may be lost when data is described in just a few features, especially for the speech signals.

Similarly, a recent study by Wang et al. [23] suggested combining a number of standard ASR features that were less processed than design features for missing data mask estimation.

The paper compares the GRBM learnt features versus 14 design features on a dual channel multisource reverberate CHiME corpus.

The cross-correlation vectors from bandpass filtered speech signals were used as input to the GRBM to generate a set of features. A single GRBM with 50 hidden units was trained with 20,000 coefficient normalized sample vectors in 2000 epochs and a mini-batch size of 64. NReLU hidden units, CD with the enhanced gradient and adaptive learning rate were used. A single standard deviation was shared and learned for all visible units.

Frequency dependent SVMs with RBF kernels are trained to predict IBMs, which are computed from the parallel data from CHiME. For IBM computation, SNRs above 0 are marked as reliable.

In evaluation, TF regions of the estimated masks that contained less than 20 connected reliable elements were removed from the masks.

Cluster-based imputation (CBI) is used to reconstruct the missing data. In CBI, a GMM is created to represent the distribution of feature vectors of clean speech. The model is used to fill the missing values of the observed feature vector with the most probable values. CBI assumes that the reliable components of the observation vector are the real values of a clean speech feature vector and unrealiable components represent an upper bound to the clean speech estimate; this is derived from the additive noise assumption which states that the energy of a singal with additive noise is always higher than the energy of a clean signal.

The recognition system is GMM-HMM LVCSR system.

The 14 design features used in the baseline conventional mask estimation system for comparisons are: modulation-filtered spectrogram, mean-to-peak-ratio and gradient of the temporal envelop, harmonic and  inharmonic energies, noise estimates from long-term inharmonic energy and channel difference, noise gain, spectral flatness, subband energy to subband noise floor ratio, ITD, ILD, peak ITD and interaural coherence.

 

 

 

 

 

A new Bayesian method incorporating with local correlation for IBM estimation

2013-A new bayesian method incorporating with local correlation for IBM estimation

A lot of efforts have been made in the Ideal Binary Mask (IBM) estimation via statistical learning methods. The Bayesian method is a common one. However, one drawback is that the mask is estimated for each time-frequency unit independently. The correlation between units has not been fully taken into account. This paper attempts to consider the local correlation information between the mask labels of adjacent units directly. It is derived from a demonstrated assumption that units which belong to one segment are mainly dominated by one source. On the other hand, a local noise level tracking stage is incorporated. The local level is obtained by averaging among several adjacent units and can be considered by averaging among several adjacent units and can be considered as an approach to true noise energy. It is used as the intermediary auxiliary variable to indicate the correlation. While some secondary factors are omitted, the high dimensional posterior distribution is simulated by a Markov Chain Monte Carlo method.

The main computation goal of CASA has been set to obtain the ideal binary mask.

This paper uses a T-F representation of a bank of auditory filters in the form on a cochleagram. Under the T-F representation, the concept of IBM is directly motivated by the auditory masking phenomenon. Roughly speaking, the louder sound causes the weaker sound inaudible within a critical band. 

The threshold LC stands for local signal-to-noise ratio in dB. Varying LC leads to different IBMs and many researchers focus on the selection of this threshold. In [21], the authors suggested that the IBM defined by -6 dB criterion produces dramatically intelligibility improvement. The study in [24], [27] showed that IBM gives the optimal SNR gain under 0 dB threshold. Generally, we could start with 0 dB and vary it unless necessary.

The input signal is decomposed into frequency domain with 64-channel gammatone filters which are standard model of cochlear filtering. The center frequencies equally distributed on the rectangular bandwidth scale from 50 Hz to 8000 Hz.

IBM estimation which is the main goal of CASA can be viewed as a binary classification problem.

Extracting accurate pitch contours from mixtures will improve the IBM estimation greatly.

This paper focus on the IBM estimation while pitch is given.

In this paper, T-F segmentation and the noise level tracking are used to depict the correlation between adjacent units from different perspectives.

 

MMSE Based Missing Feature Reconstruction With Temporal Modeling for Robust Speech Recognition

2013-MMSE based missing feature reconstruction with temporal modeling for robust speech recognition

This paper proposal a temporal modeling for missing feature reconstruction using MMSE. It falls into the feature imputation category of the missing feature theory. This paper only focuses on the second stage of the missing feature theory, i.e. when the masks are known (either oracle or estimated), how can the masked features be used for recognition. This kind of technique try to reconstruct the masked unreliable features using the noisy feature and the masked speech feature. The estimation of the noise masks are not explored, the oracle and a simple beginning and ending based noise estimates are tested.

The missing data approach to noise robust speech recognition assumes that the log-spectral features (FBanks) can be either almost unaffected by noise or completely masked by it.

The performance of automatic speech recognition systems degrades rapidly when they operate under conditions that differ from those used for training. One source of the mismatch that still remains as a major issue among the ASR research community is additive noise.

Accomplishing noise robustness is a key issue to make these systems deplorable in real world conditions.

Although marginalization performs optimal decoding with missing features, it suffers from two main drawbacks. First, the standard decoding algorithm must be modified to account for missing features. Second, recognition has to be carried out with spectral features. However, it is well known that cepstral features outperforms spectral ones for speech recognition. The acoustic model needs to employ Gaussian mixtures with full covariance matrices or an increased number of Gaussian with diagonal covariance.

For GMM based systems, the feature imputation is compulsory while for DNNs they themselves are capable of handling the missing features. Thus no imputation stage is required. The only focus would thus be how to estimate perfect masks.

A GMM is used to represent clean speech and a minimum mean square error criterion is adopted to obtain suitable estimates of the unreliable features.

Missing data assumption derived mathematically:

Screenshot from 2013-04-23 12:22:00

Both oracle and estimated masks are tested. The oracle mask is obtained by direct comparison between the clean and noisy utterances using a threshold of 7 dB SNR. In this paper, the mask estimation and spectral reconstruction is in the log mel filterbank domain and 23 filterbank channels are used. Acoustic models trained on clean speech are employed in each task. For Aurora 4, bigram LM is used for decoding.

Missing data masks are computed from noise estimates obtained through linear interpolation of initial noise statistics extracted from the beginning and final frames of every utterance.

The improvement using oracle masks is especially noticeable at medium and low SNRs. Thereby, the mismatch due to noise can be effectively reduced with only knowledge of the masking pattern.

Under proper knowledge of the masking pattern, the mismatch introduced by the noise can be significantly palliated just by a suitable exploitation of source correlations.

The proposed reconstruction techniques suffer little degradation with respect to the clean condition.

This poor performance is largely due to the simple noise estimation technique employed, which can not suitable account for non-stationary noise. However, this simple noise estimator has been useful to demonstrate the utility of the proposed temporal modeling for the MD approaches, which is able to improve the ASR performance over other MD systems consistently using the estimated masks as well as the oracle masks.

 

 

Accurate marginalization range for missing data recognition

2007-Accurate marginalization range for missing data recognition

The authors proposed a new missing data recognition approach, in which reduced marginalization intervals are computed for each possible mask. The set of all possible masks and intervals is obtained by clustering on a clean and noisy stereo training corpus. The main principle of the proposed approach consists in training accurate marginalization intervals that are as small as possible, in order to improve the precision of marginalization.

The spectral ratio X/Y between clean and noisy speech is computed on a stereo training corpus. This results in a time-frequency representation that provides for every noisy spectral feature the relative contribution of the clean speech energy. This ratio is related to the local SNR as follows:

Screenshot from 2013-04-17 13:57:47

 

The feature domain, which is also the marginalization domain of missing data, is the 12-bands Mel spectral domain with cube-root compression of the speech power. Temporal derivatives are further added, leading to a 24 dimensional feature vector.

 

Binary masking and speech intelligibility

2011-Binary masking and speech intelligibility

This thesis mainly discussed various aspects of using binary masking to improve the speech intelligibility. Several sections in the Introduction part give detailed systematic reviews for the binary masking technique:

4. Binary Masking

5. Sparsity of Speech

6. Oracle Masks

7. Application of the Binary Mask

8. Time-Frequency Masking

The noise robustness is formulated as a source separation task, “the cocktail party problem”, that separates the target speech from the interfering noise sounds.

Speech separation enhances the speech and reduces the background noise before transmission.

Human auditory system efficiently identifies and separates the sources prior to recognition at a higher level.

The decreased intelligibility can be compensated either by separating the target speech from the interfering sounds, by enhancement of the target speech, or by reducing the interfering sound.

Speech is robust and redundant which means that part of the speech sound can be lost or modified without negative impact on intelligibility. [Miller, George A., and J. C. R. Licklider. “The intelligibility of interrupted speech.” The Journal of the Acoustical Society of America 22 (1950): 167.][Warren, Richard M. “Perceptual restoration of missing speech sounds.”Science 167.3917 (1970): 392-393.][Howard‐Jones, Paul A., and Stuart Rosen. “Uncomodulated glimpsing in ‘‘checkerboard’’noise.” The Journal of the Acoustical Society of America 93 (1993): 2915.]

In binary masking, sound sources are assigned as either target or interferer in the time-frequency domain. The target sound (speech) is kept by using the value one in the binary mask, whereas the regions with the interferer are removed by using the value zero.

In short, binary masking is a method of applying a binary, frequency-dependent, and time-varying gain in a number of frequency channels, and the binary mask defines what to do when.

Estimation of the binary mask and application of the binary mask to carry out the source separation.

Oracle mask is used for binary masks calculated using a prior knowledge which is not available in most real-life applications. A major objection to the concept of oracle masks is that it is of no use in real-life applications because of the required a priori knowledge. However, the oracle masks establish an upper limit of performance, which makes them useful as references and goals for binary masking algorithms developed for real-life applications such as hearing aids.

The local SNR criterion is the threshold for classifying the time-frequency unit as dominated by the target or interferer sound and this threshold controls the amount of ones in the ideal binary mask.

The target binary mask can be calculated by comparing the target speech directly with the long-term average spectrum of the target speech.

The ideal binary mask requires the interferer to be available and will change depending on the type of interferer, whereas the target binary mask is calculated from the target sound only and therefore is independent of the interferer sound.

The masking is applied on the 64D Gammatone Filterbank on the ERB frequency scale features. 64 frequency channels are enough to achieve high intelligibility[Li, Ning, and Philipos C. Loizou. “Effect of spectral resolution on the intelligibility of ideal binary masked speech.” The Journal of the Acoustical Society of America 123.4 (2008): EL59-EL64.]. (Power or magnitude spectrum?) How it differs from the conventional mel triangular filterbank features?

Another set of features is the magnitude of the equal distance STFT frequencies.

Because the Gammatone filterbank resembles the processing in the human auditory system, it is often used for speech processing and perceptual studies. The STFT can also be used but has the drawback of requiring more frequency channels to obtain the same spectral resolution at low frequencies than the Gammatone filterbank.

The ideal binary mask has been shown to enable increased intelligibility in the ideal situation, whereas the Wiener filter, when tested under realistic conditions, shows an increase in quality while in most situations only preserving intelligibility.

 

An improved model of masking effects for robust speech recognition system

2013-An improved model of masking effects for robust speech recognition system

It integrates simultaneous masking, temporal masking and cepstral mean and variance normalization into ordinary mel-frequency cepstral coefficients feature extraction algorithm for robust speech recognition. The proposed method sharpens the power spectrum of the signal in both the frequency domain and the time domain.

One of the biggest obstacles to widespread use of automatic speech recognition technology is robustness to those mismatches between training data and testing data, which include environment noise, channel distortion and speaker variability. On the other hand, the human auditory system is extremely adept at all these mentioned situations. [Akbari Azirani, A., R. le Bouquin Jeannes, and G. Faucon. “Optimizing speech enhancement by exploiting masking properties of the human ear.” Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on. Vol. 1. IEEE, 1995.][Hermansky, Hynek. “Should recognizers have ears?.” Speech Communication25.1 (1998): 3-27.]

The simultaneous masking in this paper is applied on the power spectral features from FFT.

In temporal masking, forward masking is much more effective than backward masking.

The proposed front-end feature extraction is modified from MFCC model provided by Voicebox by integrating the lateral inhibition masking, temporal spectral averaging and forward masking, which is shown below:

Screenshot from 2013-04-16 14:25:15

 

The lateral inhibition is applied on mel-frequency scale. A combination of the original signals and the lateral inhibition outputs is used as the input signal to a recognizer.

Lateral inhibition is very sensitive to spectral changes and it will enhance the peaks in both speech signals and noise signals. Hence, temporal spectral averaging in spectral domain is applied.

The CMVN is very effective in practice where it compensates for long term spectral effects such as communication channel distortion.

In this paper, they also used 24 mel filter coefficients.

 

On noise masking for automatic missing data speech recognition: a survey and discussion

2007-On noise masking for automatic missing data speech recognition a survey and discussion

In order to restrict the range of methods, only the techniques using a single microphone are considered.

A similar principle is used in a few other robust methods, such as uncertainty decoding or detection-based recognition.

These so called missing data masks play a fundamental role in missing data recognition, and their quality has a strong impact on the final recognition accuracy.

Although in theory a mask can be defined for any parameterization domain, in practice, such a domain should map distinct frequency bands onto different feature space dimensions, so that a frequency-limited noise only affects a few dimensions of the feature space. This is typically true for frequency-like and wavelet-based parameterizations and for most auditory-inspired analysis front-ends. This is not the case in the cepstrum.

Several missing data studies have shown that soft masks give better results than hard masks.

When the mask is computed based on the local SnR given both the clean and noisy speech signals, the resulting mask is called an oracle mask.

Two common techniques for handling missing features during recognition: data imputation and data marginalization.

Training stochastic models of masks seem a promising approach to infer efficient masks.