Monthly Archives: June 2013

GRBM and automatic feature extraction for noise robust missing data mask estimation

GRBM and automatic feature extraction for noise robust missing data mask estimation

In this paper, GRBM is used as an unsupervised feature extractor to automatically discover features for the prediction of IBM. The GRBM extracted features are then fed into SVM classifiers to predict whether a specific time-frequency unit of the input speech signal should be masked out or not.

Some of the previous work on the field have considered mask estimation as a binary classification problem by training machine learning based classifiers such as GMMs or SVMs with several acoustic features in conjunction. There multi-feature approaches counteract the adverse environmental factors with their comprehensive set of features – cues discriminating between speech and non-speech are effective in non-speech noisy environments [4], whereas directional cues provide information on competing speakers [5,7].

As an alternative to basing the multi-feature approach on a set of “design” features, a GRBM can be trained to learn the acoustical patterns for an arguable better performing set of features.

Ultimately, the confrontation between design and automatically learned features reduces to quantity versus quality; the discrimination power of a single automatically learned feature may be small but the number of them can be made arbitrarily large, whereas a single design feature such as interaural time difference or interaural level difference may be effective alone but the overall number of them is usually much smaller.

In some of the previous studies, the common approach for time-frequency unit classification has been to develop descriptive heuristic measures, or design features, some of which are processed through a rather complex model [4,5]. However, relevant information may be lost when data is described in just a few features, especially for the speech signals.

Similarly, a recent study by Wang et al. [23] suggested combining a number of standard ASR features that were less processed than design features for missing data mask estimation.

The paper compares the GRBM learnt features versus 14 design features on a dual channel multisource reverberate CHiME corpus.

The cross-correlation vectors from bandpass filtered speech signals were used as input to the GRBM to generate a set of features. A single GRBM with 50 hidden units was trained with 20,000 coefficient normalized sample vectors in 2000 epochs and a mini-batch size of 64. NReLU hidden units, CD with the enhanced gradient and adaptive learning rate were used. A single standard deviation was shared and learned for all visible units.

Frequency dependent SVMs with RBF kernels are trained to predict IBMs, which are computed from the parallel data from CHiME. For IBM computation, SNRs above 0 are marked as reliable.

In evaluation, TF regions of the estimated masks that contained less than 20 connected reliable elements were removed from the masks.

Cluster-based imputation (CBI) is used to reconstruct the missing data. In CBI, a GMM is created to represent the distribution of feature vectors of clean speech. The model is used to fill the missing values of the observed feature vector with the most probable values. CBI assumes that the reliable components of the observation vector are the real values of a clean speech feature vector and unrealiable components represent an upper bound to the clean speech estimate; this is derived from the additive noise assumption which states that the energy of a singal with additive noise is always higher than the energy of a clean signal.

The recognition system is GMM-HMM LVCSR system.

The 14 design features used in the baseline conventional mask estimation system for comparisons are: modulation-filtered spectrogram, mean-to-peak-ratio and gradient of the temporal envelop, harmonic and  inharmonic energies, noise estimates from long-term inharmonic energy and channel difference, noise gain, spectral flatness, subband energy to subband noise floor ratio, ITD, ILD, peak ITD and interaural coherence.