Convolution Neural Network for speech recognition

[2012 NIPS; ImageNet Classification with Deep Convolutional Neural Networks; Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton]

The model should also have lots of prior knowledge to compensate for all the data we don’t have. CNN’s capacity can be controlled by varying the depth and breadth, and it also make strong and mostly correct assumptions, namely, stationarity of statistics and locality of pixel dependencies. Highly-optimized GPU implementation of 2D convolution.

5 convolutional and 3 fully-connected layers.

The model is trained on the raw RGB pixel values. The filters are thus 3D. 1000-way softmax output, cross-entropy. The first convolutional layer filters the 224*224*3 input image with 96 kernels of size 11*11*3 with a stride of 4 pixels. The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5*5*48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The 3rd layer has 384 kernels of size 3*3*256. The 4th has 384 kernels of size 3*3*192 and the 5th has 256 kernels of size 3*3*192. The fully-connected layers have 4096 neurons each.

1. ReLU Nonlinearity rather than sigmoid

2. Two GPU parallel training

3. Local response normalization

4. Overlapping pooling

5. Dropout in fully-connected layers

[2012 ICASSP; Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition; Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Gerald Penn]

MFCC features are not suitable. Linear spectrum, Mel-scale spectrum, or filter-banks are all perfect for local filtering in CNN. Let’s assume speech input to CNN is v that is divided into B frequency bands as : v = [v1, v2, …, vB], where vb is the feature vector representing band b. This feature vector vb includes speech spectral features, delta and acceleration parameters from local band b of all feature frames within the current context window.

Activations of the convolution layer are divided into K bands where each band contains J filter activations.

Max-pooling layer generates a lower resolution version of the convolution layer by doing maximization operation every n bands, converting the previous K bands to M bands for each of the J filters (M<K).

Weight sharing: the local filter weights are tied and shared for all positions within the whole input space. Different from the local filter used in images, the weight sharing are only inside the pooling group. As a result, the convolution layer is divided into a number of convolution sections where all convolution bands in each section are pooled together into one pooling layer band and are computed by convolving section filters with a small number of the input layer bands. Each convolution section has J filters.

One disadvantage of this weight sharing is that other pooling layers cannot be added on top of it because the filters outputs in different pooling bands are not related. Therefore, this type of weight sharing is normally used only in the topmost pooling layer.

In training stage, CNN is estimated using the standard back-propagation algorithm to minimize cross entropy of targets and output layer activations. For a max-pooling layer, the error signal is back-propagated only to the convolution layer node that generates the maximum activation within the pooled nodes.


40D Mel FBanks + energy, and 1st and 2nd derivatives. Global CMVN. w15. The input of CNN is divided into 40 bands, each includes one of the 40 FBank along the 15 frames context window. Energy is duplicated to for all the bands. Input padding is used.

CNN is composed of a convolution layer with the limited weight sharing and filter size of 8 bands, a max pooling layer with a sub-sampling factor of 2, and one top fully-connected 1000D hidden layer and one softmax output layer. Compared with a 2 1000D hidden layers DNN.

Deep CNN: a convolution layer, a max-pooling layer and two fully-connected 1000D hidden layers. 84 filters, filter size of 8 bands, pooling size of 6 with limited weight sharing and a sub-sampling factor of 2. 20.07% WER vs. 20.50% obtained from DNN with 3 1000D hidden layers.

[2013 ICASSP; A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion; Li Deng, Ossama Abdel-Hamid, Dong Yu]

Motivation: A larger pooling size enforces a greater degree of invariance in frequency shift but this also carries greater risk of being unable to distinguish among different speech sounds with similar formant frequencies. When a fixed pooling size increases from 1 to 12, increasing confusions among the phones whose major formant frequencies are close to each other is observed.

A natural way is to apply different or heterogeneous pooling size to various subsets of the full feature maps. Following figure illustrate a Heterogeneous-Pooling (HP) CNN with two sets of pooling sizes P1 (of value 2) and P2 (of value 3):

The optimal choice of the pooling sizes is determined by the convolution filter design and, more importantly, by the nature of the phonetic space expressed in scaled frequency in accordance with the input FBank features.

Dropout is applied to both the convolution hidden layer and fully-connected hidden layer.

[2013 ICASSP; Deep convolutional neural networks for LVCSR; Tara N. Sainath, Abdel-rahman Mohamed, Brian Kingsbury, Bhuvana Ramabhadran]

This paper determins the appropriate architecture to make CNNs effective for LVCSR tasks. Specifically, how many convolutional layers, optimal hidden units, best pooling strategy, and best input features. Evaluated on a 400-hr Broadcast News and 300-hr Switchboard task.

The limited weight sharing prohibits adding more convolution layers. In this paper, the authors argue that they can apply weight sharing across all time and frequency components, by using a large number of hidden units in the convolutional layers to capture the differences between low and high frequency components.

[2013 IS; Exploring convolutional neural network structures and optimization techniques for speech recognition; Ossama Abdel-Hamid, Li Deng, Dong Yu]

1. comparison between full and limited weight sharing

2. convolution in time domain

3. weighted softmax pooling: replacing the max operator with a weighted softmax with learnable weights. The weights have similar effect to modifying pooling size.

4. Convolutional RBM based pre-training

Limited weight sharing provides more gain but full weight sharing is required to use multiple convolution layers. Time domain convolution does not improve the performance. The gain from weighted softmax is small but promising. Pre-training for CNN has relatively small gains compared for DNNs.


One thought on “Convolution Neural Network for speech recognition

  1. Kid April 9, 2017 at 9:30 AM Reply

    Thanks for the post.

    I am currently in a situation in which i have to try these things out.
    I am not quite sure I understand the concept of limited weight sharing, and what kind of network structure they are seeking.

    Regarding – limited sharing:
    I understand the concept that interesting pattern may not equally appear anywhere on a image, and therefore a limited weight sharing scheme is necessary such that the search area is limited in that range.

    I am using a 78D mel Fbanks + delta + delta-delta for 72 frames as context window – which is not quite the same as the one the paper suggest, but i don’t see any harm in using more frequency bands.

    What I don’t understand is what the input to the network is. Does the network have 78 input channels with each row, or 72 channels with each column?

    if a:
    does each input then have its own convolutional and pooling?
    and how many filter?

    if b:
    how is the limited weight sharing applied?

    Regarding the network structure:

    I am sure understand how this network can perform as it do only one pair (convolution + pooling) or does it have many pairs? If so how many? cause it doesn’t make sense that only one pair should work that well.. and especially as the section may contain multiple phonemes it has to detect.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: