Monthly Archives: September 2013

Convolution Neural Network for speech recognition

[2012 NIPS; ImageNet Classification with Deep Convolutional Neural Networks; Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton]

The model should also have lots of prior knowledge to compensate for all the data we don’t have. CNN’s capacity can be controlled by varying the depth and breadth, and it also make strong and mostly correct assumptions, namely, stationarity of statistics and locality of pixel dependencies. Highly-optimized GPU implementation of 2D convolution.

5 convolutional and 3 fully-connected layers.

The model is trained on the raw RGB pixel values. The filters are thus 3D. 1000-way softmax output, cross-entropy. The first convolutional layer filters the 224*224*3 input image with 96 kernels of size 11*11*3 with a stride of 4 pixels. The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5*5*48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The 3rd layer has 384 kernels of size 3*3*256. The 4th has 384 kernels of size 3*3*192 and the 5th has 256 kernels of size 3*3*192. The fully-connected layers have 4096 neurons each.

1. ReLU Nonlinearity rather than sigmoid

2. Two GPU parallel training

3. Local response normalization

4. Overlapping pooling

5. Dropout in fully-connected layers

[2012 ICASSP; Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition; Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Gerald Penn]

MFCC features are not suitable. Linear spectrum, Mel-scale spectrum, or filter-banks are all perfect for local filtering in CNN. Let’s assume speech input to CNN is v that is divided into B frequency bands as : v = [v1, v2, …, vB], where vb is the feature vector representing band b. This feature vector vb includes speech spectral features, delta and acceleration parameters from local band b of all feature frames within the current context window.

Activations of the convolution layer are divided into K bands where each band contains J filter activations.

Max-pooling layer generates a lower resolution version of the convolution layer by doing maximization operation every n bands, converting the previous K bands to M bands for each of the J filters (M<K).

Weight sharing: the local filter weights are tied and shared for all positions within the whole input space. Different from the local filter used in images, the weight sharing are only inside the pooling group. As a result, the convolution layer is divided into a number of convolution sections where all convolution bands in each section are pooled together into one pooling layer band and are computed by convolving section filters with a small number of the input layer bands. Each convolution section has J filters.

One disadvantage of this weight sharing is that other pooling layers cannot be added on top of it because the filters outputs in different pooling bands are not related. Therefore, this type of weight sharing is normally used only in the topmost pooling layer.

In training stage, CNN is estimated using the standard back-propagation algorithm to minimize cross entropy of targets and output layer activations. For a max-pooling layer, the error signal is back-propagated only to the convolution layer node that generates the maximum activation within the pooled nodes.


40D Mel FBanks + energy, and 1st and 2nd derivatives. Global CMVN. w15. The input of CNN is divided into 40 bands, each includes one of the 40 FBank along the 15 frames context window. Energy is duplicated to for all the bands. Input padding is used.

CNN is composed of a convolution layer with the limited weight sharing and filter size of 8 bands, a max pooling layer with a sub-sampling factor of 2, and one top fully-connected 1000D hidden layer and one softmax output layer. Compared with a 2 1000D hidden layers DNN.

Deep CNN: a convolution layer, a max-pooling layer and two fully-connected 1000D hidden layers. 84 filters, filter size of 8 bands, pooling size of 6 with limited weight sharing and a sub-sampling factor of 2. 20.07% WER vs. 20.50% obtained from DNN with 3 1000D hidden layers.

[2013 ICASSP; A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion; Li Deng, Ossama Abdel-Hamid, Dong Yu]

Motivation: A larger pooling size enforces a greater degree of invariance in frequency shift but this also carries greater risk of being unable to distinguish among different speech sounds with similar formant frequencies. When a fixed pooling size increases from 1 to 12, increasing confusions among the phones whose major formant frequencies are close to each other is observed.

A natural way is to apply different or heterogeneous pooling size to various subsets of the full feature maps. Following figure illustrate a Heterogeneous-Pooling (HP) CNN with two sets of pooling sizes P1 (of value 2) and P2 (of value 3):

The optimal choice of the pooling sizes is determined by the convolution filter design and, more importantly, by the nature of the phonetic space expressed in scaled frequency in accordance with the input FBank features.

Dropout is applied to both the convolution hidden layer and fully-connected hidden layer.

[2013 ICASSP; Deep convolutional neural networks for LVCSR; Tara N. Sainath, Abdel-rahman Mohamed, Brian Kingsbury, Bhuvana Ramabhadran]

This paper determins the appropriate architecture to make CNNs effective for LVCSR tasks. Specifically, how many convolutional layers, optimal hidden units, best pooling strategy, and best input features. Evaluated on a 400-hr Broadcast News and 300-hr Switchboard task.

The limited weight sharing prohibits adding more convolution layers. In this paper, the authors argue that they can apply weight sharing across all time and frequency components, by using a large number of hidden units in the convolutional layers to capture the differences between low and high frequency components.

[2013 IS; Exploring convolutional neural network structures and optimization techniques for speech recognition; Ossama Abdel-Hamid, Li Deng, Dong Yu]

1. comparison between full and limited weight sharing

2. convolution in time domain

3. weighted softmax pooling: replacing the max operator with a weighted softmax with learnable weights. The weights have similar effect to modifying pooling size.

4. Convolutional RBM based pre-training

Limited weight sharing provides more gain but full weight sharing is required to use multiple convolution layers. Time domain convolution does not improve the performance. The gain from weighted softmax is small but promising. Pre-training for CNN has relatively small gains compared for DNNs.


Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code; Ossama Abdel-Hamid, Hui Jiang; 2013 ICASSP

A joint training procedure to learn a generic adaptation NN from the whole training set as well as many small speaker codes, one of which is estimated for each speaker only using data from that particular speaker.

Training parameters: the original NN weights, the adaptation weights and the training speakers codes.

Training methods: standard back-propagation with the cross entropy objective function. Adaptation weights and speaker codes are randomly initialized.

Testing: supervised adaptation, only the speaker code is learnt using back-propagation.

Experiments: 40D Mel FBanks + energy and 1st and 2nd temporal derivatives. Global CMVN normalization. Bigram LM. 15 frames input window.

Testing is conducted for each speaker based on a cross validation method. In each run, n utterances for a specific speaker are used for supervised adaptation and the remaining 8-n are used for test. Totally 8 runs per speaker. The overall averaged performance is reported.

Adaptation NN has two 1000D sigmoid hidden layers and a linear output layer. 50D speaker code.

dummy: no speaker codes; 0: speaker codes are all 0s; oracle: same data for adaptation and testing.