Convolution Neural Network for speech recognition

[2012 NIPS; ImageNet Classification with Deep Convolutional Neural Networks; Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton]

The model should also have lots of prior knowledge to compensate for all the data we don’t have. CNN’s capacity can be controlled by varying the depth and breadth, and it also make strong and mostly correct assumptions, namely, stationarity of statistics and locality of pixel dependencies. Highly-optimized GPU implementation of 2D convolution.

5 convolutional and 3 fully-connected layers.

The model is trained on the raw RGB pixel values. The filters are thus 3D. 1000-way softmax output, cross-entropy. The first convolutional layer filters the 224*224*3 input image with 96 kernels of size 11*11*3 with a stride of 4 pixels. The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5*5*48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The 3rd layer has 384 kernels of size 3*3*256. The 4th has 384 kernels of size 3*3*192 and the 5th has 256 kernels of size 3*3*192. The fully-connected layers have 4096 neurons each.

1. ReLU Nonlinearity rather than sigmoid

2. Two GPU parallel training

3. Local response normalization

4. Overlapping pooling

5. Dropout in fully-connected layers

[2012 ICASSP; Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition; Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Gerald Penn]

MFCC features are not suitable. Linear spectrum, Mel-scale spectrum, or filter-banks are all perfect for local filtering in CNN. Let’s assume speech input to CNN is v that is divided into B frequency bands as : v = [v1, v2, …, vB], where vb is the feature vector representing band b. This feature vector vb includes speech spectral features, delta and acceleration parameters from local band b of all feature frames within the current context window.

Activations of the convolution layer are divided into K bands where each band contains J filter activations.

Max-pooling layer generates a lower resolution version of the convolution layer by doing maximization operation every n bands, converting the previous K bands to M bands for each of the J filters (M<K).

Weight sharing: the local filter weights are tied and shared for all positions within the whole input space. Different from the local filter used in images, the weight sharing are only inside the pooling group. As a result, the convolution layer is divided into a number of convolution sections where all convolution bands in each section are pooled together into one pooling layer band and are computed by convolving section filters with a small number of the input layer bands. Each convolution section has J filters.

One disadvantage of this weight sharing is that other pooling layers cannot be added on top of it because the filters outputs in different pooling bands are not related. Therefore, this type of weight sharing is normally used only in the topmost pooling layer.

In training stage, CNN is estimated using the standard back-propagation algorithm to minimize cross entropy of targets and output layer activations. For a max-pooling layer, the error signal is back-propagated only to the convolution layer node that generates the maximum activation within the pooled nodes.


40D Mel FBanks + energy, and 1st and 2nd derivatives. Global CMVN. w15. The input of CNN is divided into 40 bands, each includes one of the 40 FBank along the 15 frames context window. Energy is duplicated to for all the bands. Input padding is used.

CNN is composed of a convolution layer with the limited weight sharing and filter size of 8 bands, a max pooling layer with a sub-sampling factor of 2, and one top fully-connected 1000D hidden layer and one softmax output layer. Compared with a 2 1000D hidden layers DNN.

Deep CNN: a convolution layer, a max-pooling layer and two fully-connected 1000D hidden layers. 84 filters, filter size of 8 bands, pooling size of 6 with limited weight sharing and a sub-sampling factor of 2. 20.07% WER vs. 20.50% obtained from DNN with 3 1000D hidden layers.

[2013 ICASSP; A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion; Li Deng, Ossama Abdel-Hamid, Dong Yu]

Motivation: A larger pooling size enforces a greater degree of invariance in frequency shift but this also carries greater risk of being unable to distinguish among different speech sounds with similar formant frequencies. When a fixed pooling size increases from 1 to 12, increasing confusions among the phones whose major formant frequencies are close to each other is observed.

A natural way is to apply different or heterogeneous pooling size to various subsets of the full feature maps. Following figure illustrate a Heterogeneous-Pooling (HP) CNN with two sets of pooling sizes P1 (of value 2) and P2 (of value 3):

The optimal choice of the pooling sizes is determined by the convolution filter design and, more importantly, by the nature of the phonetic space expressed in scaled frequency in accordance with the input FBank features.

Dropout is applied to both the convolution hidden layer and fully-connected hidden layer.

[2013 ICASSP; Deep convolutional neural networks for LVCSR; Tara N. Sainath, Abdel-rahman Mohamed, Brian Kingsbury, Bhuvana Ramabhadran]

This paper determins the appropriate architecture to make CNNs effective for LVCSR tasks. Specifically, how many convolutional layers, optimal hidden units, best pooling strategy, and best input features. Evaluated on a 400-hr Broadcast News and 300-hr Switchboard task.

The limited weight sharing prohibits adding more convolution layers. In this paper, the authors argue that they can apply weight sharing across all time and frequency components, by using a large number of hidden units in the convolutional layers to capture the differences between low and high frequency components.

[2013 IS; Exploring convolutional neural network structures and optimization techniques for speech recognition; Ossama Abdel-Hamid, Li Deng, Dong Yu]

1. comparison between full and limited weight sharing

2. convolution in time domain

3. weighted softmax pooling: replacing the max operator with a weighted softmax with learnable weights. The weights have similar effect to modifying pooling size.

4. Convolutional RBM based pre-training

Limited weight sharing provides more gain but full weight sharing is required to use multiple convolution layers. Time domain convolution does not improve the performance. The gain from weighted softmax is small but promising. Pre-training for CNN has relatively small gains compared for DNNs.

Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code; Ossama Abdel-Hamid, Hui Jiang; 2013 ICASSP

A joint training procedure to learn a generic adaptation NN from the whole training set as well as many small speaker codes, one of which is estimated for each speaker only using data from that particular speaker.

Training parameters: the original NN weights, the adaptation weights and the training speakers codes.

Training methods: standard back-propagation with the cross entropy objective function. Adaptation weights and speaker codes are randomly initialized.

Testing: supervised adaptation, only the speaker code is learnt using back-propagation.

Experiments: 40D Mel FBanks + energy and 1st and 2nd temporal derivatives. Global CMVN normalization. Bigram LM. 15 frames input window.

Testing is conducted for each speaker based on a cross validation method. In each run, n utterances for a specific speaker are used for supervised adaptation and the remaining 8-n are used for test. Totally 8 runs per speaker. The overall averaged performance is reported.

Adaptation NN has two 1000D sigmoid hidden layers and a linear output layer. 50D speaker code.

dummy: no speaker codes; 0: speaker codes are all 0s; oracle: same data for adaptation and testing.

GRBM and automatic feature extraction for noise robust missing data mask estimation

GRBM and automatic feature extraction for noise robust missing data mask estimation

In this paper, GRBM is used as an unsupervised feature extractor to automatically discover features for the prediction of IBM. The GRBM extracted features are then fed into SVM classifiers to predict whether a specific time-frequency unit of the input speech signal should be masked out or not.

Some of the previous work on the field have considered mask estimation as a binary classification problem by training machine learning based classifiers such as GMMs or SVMs with several acoustic features in conjunction. There multi-feature approaches counteract the adverse environmental factors with their comprehensive set of features – cues discriminating between speech and non-speech are effective in non-speech noisy environments [4], whereas directional cues provide information on competing speakers [5,7].

As an alternative to basing the multi-feature approach on a set of “design” features, a GRBM can be trained to learn the acoustical patterns for an arguable better performing set of features.

Ultimately, the confrontation between design and automatically learned features reduces to quantity versus quality; the discrimination power of a single automatically learned feature may be small but the number of them can be made arbitrarily large, whereas a single design feature such as interaural time difference or interaural level difference may be effective alone but the overall number of them is usually much smaller.

In some of the previous studies, the common approach for time-frequency unit classification has been to develop descriptive heuristic measures, or design features, some of which are processed through a rather complex model [4,5]. However, relevant information may be lost when data is described in just a few features, especially for the speech signals.

Similarly, a recent study by Wang et al. [23] suggested combining a number of standard ASR features that were less processed than design features for missing data mask estimation.

The paper compares the GRBM learnt features versus 14 design features on a dual channel multisource reverberate CHiME corpus.

The cross-correlation vectors from bandpass filtered speech signals were used as input to the GRBM to generate a set of features. A single GRBM with 50 hidden units was trained with 20,000 coefficient normalized sample vectors in 2000 epochs and a mini-batch size of 64. NReLU hidden units, CD with the enhanced gradient and adaptive learning rate were used. A single standard deviation was shared and learned for all visible units.

Frequency dependent SVMs with RBF kernels are trained to predict IBMs, which are computed from the parallel data from CHiME. For IBM computation, SNRs above 0 are marked as reliable.

In evaluation, TF regions of the estimated masks that contained less than 20 connected reliable elements were removed from the masks.

Cluster-based imputation (CBI) is used to reconstruct the missing data. In CBI, a GMM is created to represent the distribution of feature vectors of clean speech. The model is used to fill the missing values of the observed feature vector with the most probable values. CBI assumes that the reliable components of the observation vector are the real values of a clean speech feature vector and unrealiable components represent an upper bound to the clean speech estimate; this is derived from the additive noise assumption which states that the energy of a singal with additive noise is always higher than the energy of a clean signal.

The recognition system is GMM-HMM LVCSR system.

The 14 design features used in the baseline conventional mask estimation system for comparisons are: modulation-filtered spectrogram, mean-to-peak-ratio and gradient of the temporal envelop, harmonic and  inharmonic energies, noise estimates from long-term inharmonic energy and channel difference, noise gain, spectral flatness, subband energy to subband noise floor ratio, ITD, ILD, peak ITD and interaural coherence.






A new Bayesian method incorporating with local correlation for IBM estimation

2013-A new bayesian method incorporating with local correlation for IBM estimation

A lot of efforts have been made in the Ideal Binary Mask (IBM) estimation via statistical learning methods. The Bayesian method is a common one. However, one drawback is that the mask is estimated for each time-frequency unit independently. The correlation between units has not been fully taken into account. This paper attempts to consider the local correlation information between the mask labels of adjacent units directly. It is derived from a demonstrated assumption that units which belong to one segment are mainly dominated by one source. On the other hand, a local noise level tracking stage is incorporated. The local level is obtained by averaging among several adjacent units and can be considered by averaging among several adjacent units and can be considered as an approach to true noise energy. It is used as the intermediary auxiliary variable to indicate the correlation. While some secondary factors are omitted, the high dimensional posterior distribution is simulated by a Markov Chain Monte Carlo method.

The main computation goal of CASA has been set to obtain the ideal binary mask.

This paper uses a T-F representation of a bank of auditory filters in the form on a cochleagram. Under the T-F representation, the concept of IBM is directly motivated by the auditory masking phenomenon. Roughly speaking, the louder sound causes the weaker sound inaudible within a critical band. 

The threshold LC stands for local signal-to-noise ratio in dB. Varying LC leads to different IBMs and many researchers focus on the selection of this threshold. In [21], the authors suggested that the IBM defined by -6 dB criterion produces dramatically intelligibility improvement. The study in [24], [27] showed that IBM gives the optimal SNR gain under 0 dB threshold. Generally, we could start with 0 dB and vary it unless necessary.

The input signal is decomposed into frequency domain with 64-channel gammatone filters which are standard model of cochlear filtering. The center frequencies equally distributed on the rectangular bandwidth scale from 50 Hz to 8000 Hz.

IBM estimation which is the main goal of CASA can be viewed as a binary classification problem.

Extracting accurate pitch contours from mixtures will improve the IBM estimation greatly.

This paper focus on the IBM estimation while pitch is given.

In this paper, T-F segmentation and the noise level tracking are used to depict the correlation between adjacent units from different perspectives.


MMSE Based Missing Feature Reconstruction With Temporal Modeling for Robust Speech Recognition

2013-MMSE based missing feature reconstruction with temporal modeling for robust speech recognition

This paper proposal a temporal modeling for missing feature reconstruction using MMSE. It falls into the feature imputation category of the missing feature theory. This paper only focuses on the second stage of the missing feature theory, i.e. when the masks are known (either oracle or estimated), how can the masked features be used for recognition. This kind of technique try to reconstruct the masked unreliable features using the noisy feature and the masked speech feature. The estimation of the noise masks are not explored, the oracle and a simple beginning and ending based noise estimates are tested.

The missing data approach to noise robust speech recognition assumes that the log-spectral features (FBanks) can be either almost unaffected by noise or completely masked by it.

The performance of automatic speech recognition systems degrades rapidly when they operate under conditions that differ from those used for training. One source of the mismatch that still remains as a major issue among the ASR research community is additive noise.

Accomplishing noise robustness is a key issue to make these systems deplorable in real world conditions.

Although marginalization performs optimal decoding with missing features, it suffers from two main drawbacks. First, the standard decoding algorithm must be modified to account for missing features. Second, recognition has to be carried out with spectral features. However, it is well known that cepstral features outperforms spectral ones for speech recognition. The acoustic model needs to employ Gaussian mixtures with full covariance matrices or an increased number of Gaussian with diagonal covariance.

For GMM based systems, the feature imputation is compulsory while for DNNs they themselves are capable of handling the missing features. Thus no imputation stage is required. The only focus would thus be how to estimate perfect masks.

A GMM is used to represent clean speech and a minimum mean square error criterion is adopted to obtain suitable estimates of the unreliable features.

Missing data assumption derived mathematically:

Screenshot from 2013-04-23 12:22:00

Both oracle and estimated masks are tested. The oracle mask is obtained by direct comparison between the clean and noisy utterances using a threshold of 7 dB SNR. In this paper, the mask estimation and spectral reconstruction is in the log mel filterbank domain and 23 filterbank channels are used. Acoustic models trained on clean speech are employed in each task. For Aurora 4, bigram LM is used for decoding.

Missing data masks are computed from noise estimates obtained through linear interpolation of initial noise statistics extracted from the beginning and final frames of every utterance.

The improvement using oracle masks is especially noticeable at medium and low SNRs. Thereby, the mismatch due to noise can be effectively reduced with only knowledge of the masking pattern.

Under proper knowledge of the masking pattern, the mismatch introduced by the noise can be significantly palliated just by a suitable exploitation of source correlations.

The proposed reconstruction techniques suffer little degradation with respect to the clean condition.

This poor performance is largely due to the simple noise estimation technique employed, which can not suitable account for non-stationary noise. However, this simple noise estimator has been useful to demonstrate the utility of the proposed temporal modeling for the MD approaches, which is able to improve the ASR performance over other MD systems consistently using the estimated masks as well as the oracle masks.



Accurate marginalization range for missing data recognition

2007-Accurate marginalization range for missing data recognition

The authors proposed a new missing data recognition approach, in which reduced marginalization intervals are computed for each possible mask. The set of all possible masks and intervals is obtained by clustering on a clean and noisy stereo training corpus. The main principle of the proposed approach consists in training accurate marginalization intervals that are as small as possible, in order to improve the precision of marginalization.

The spectral ratio X/Y between clean and noisy speech is computed on a stereo training corpus. This results in a time-frequency representation that provides for every noisy spectral feature the relative contribution of the clean speech energy. This ratio is related to the local SNR as follows:

Screenshot from 2013-04-17 13:57:47


The feature domain, which is also the marginalization domain of missing data, is the 12-bands Mel spectral domain with cube-root compression of the speech power. Temporal derivatives are further added, leading to a 24 dimensional feature vector.


Binary masking and speech intelligibility

2011-Binary masking and speech intelligibility

This thesis mainly discussed various aspects of using binary masking to improve the speech intelligibility. Several sections in the Introduction part give detailed systematic reviews for the binary masking technique:

4. Binary Masking

5. Sparsity of Speech

6. Oracle Masks

7. Application of the Binary Mask

8. Time-Frequency Masking

The noise robustness is formulated as a source separation task, “the cocktail party problem”, that separates the target speech from the interfering noise sounds.

Speech separation enhances the speech and reduces the background noise before transmission.

Human auditory system efficiently identifies and separates the sources prior to recognition at a higher level.

The decreased intelligibility can be compensated either by separating the target speech from the interfering sounds, by enhancement of the target speech, or by reducing the interfering sound.

Speech is robust and redundant which means that part of the speech sound can be lost or modified without negative impact on intelligibility. [Miller, George A., and J. C. R. Licklider. “The intelligibility of interrupted speech.” The Journal of the Acoustical Society of America 22 (1950): 167.][Warren, Richard M. “Perceptual restoration of missing speech sounds.”Science 167.3917 (1970): 392-393.][Howard‐Jones, Paul A., and Stuart Rosen. “Uncomodulated glimpsing in ‘‘checkerboard’’noise.” The Journal of the Acoustical Society of America 93 (1993): 2915.]

In binary masking, sound sources are assigned as either target or interferer in the time-frequency domain. The target sound (speech) is kept by using the value one in the binary mask, whereas the regions with the interferer are removed by using the value zero.

In short, binary masking is a method of applying a binary, frequency-dependent, and time-varying gain in a number of frequency channels, and the binary mask defines what to do when.

Estimation of the binary mask and application of the binary mask to carry out the source separation.

Oracle mask is used for binary masks calculated using a prior knowledge which is not available in most real-life applications. A major objection to the concept of oracle masks is that it is of no use in real-life applications because of the required a priori knowledge. However, the oracle masks establish an upper limit of performance, which makes them useful as references and goals for binary masking algorithms developed for real-life applications such as hearing aids.

The local SNR criterion is the threshold for classifying the time-frequency unit as dominated by the target or interferer sound and this threshold controls the amount of ones in the ideal binary mask.

The target binary mask can be calculated by comparing the target speech directly with the long-term average spectrum of the target speech.

The ideal binary mask requires the interferer to be available and will change depending on the type of interferer, whereas the target binary mask is calculated from the target sound only and therefore is independent of the interferer sound.

The masking is applied on the 64D Gammatone Filterbank on the ERB frequency scale features. 64 frequency channels are enough to achieve high intelligibility[Li, Ning, and Philipos C. Loizou. “Effect of spectral resolution on the intelligibility of ideal binary masked speech.” The Journal of the Acoustical Society of America 123.4 (2008): EL59-EL64.]. (Power or magnitude spectrum?) How it differs from the conventional mel triangular filterbank features?

Another set of features is the magnitude of the equal distance STFT frequencies.

Because the Gammatone filterbank resembles the processing in the human auditory system, it is often used for speech processing and perceptual studies. The STFT can also be used but has the drawback of requiring more frequency channels to obtain the same spectral resolution at low frequencies than the Gammatone filterbank.

The ideal binary mask has been shown to enable increased intelligibility in the ideal situation, whereas the Wiener filter, when tested under realistic conditions, shows an increase in quality while in most situations only preserving intelligibility.