This paper proposal a temporal modeling for missing feature reconstruction using MMSE. It falls into the feature imputation category of the missing feature theory. This paper only focuses on the second stage of the missing feature theory, i.e. when the masks are known (either oracle or estimated), how can the masked features be used for recognition. This kind of technique try to reconstruct the masked unreliable features using the noisy feature and the masked speech feature. The estimation of the noise masks are not explored, the oracle and a simple beginning and ending based noise estimates are tested.
The missing data approach to noise robust speech recognition assumes that the log-spectral features (FBanks) can be either almost unaffected by noise or completely masked by it.
The performance of automatic speech recognition systems degrades rapidly when they operate under conditions that differ from those used for training. One source of the mismatch that still remains as a major issue among the ASR research community is additive noise.
Accomplishing noise robustness is a key issue to make these systems deplorable in real world conditions.
Although marginalization performs optimal decoding with missing features, it suffers from two main drawbacks. First, the standard decoding algorithm must be modified to account for missing features. Second, recognition has to be carried out with spectral features. However, it is well known that cepstral features outperforms spectral ones for speech recognition. The acoustic model needs to employ Gaussian mixtures with full covariance matrices or an increased number of Gaussian with diagonal covariance.
For GMM based systems, the feature imputation is compulsory while for DNNs they themselves are capable of handling the missing features. Thus no imputation stage is required. The only focus would thus be how to estimate perfect masks.
A GMM is used to represent clean speech and a minimum mean square error criterion is adopted to obtain suitable estimates of the unreliable features.
Missing data assumption derived mathematically:
Both oracle and estimated masks are tested. The oracle mask is obtained by direct comparison between the clean and noisy utterances using a threshold of 7 dB SNR. In this paper, the mask estimation and spectral reconstruction is in the log mel filterbank domain and 23 filterbank channels are used. Acoustic models trained on clean speech are employed in each task. For Aurora 4, bigram LM is used for decoding.
Missing data masks are computed from noise estimates obtained through linear interpolation of initial noise statistics extracted from the beginning and final frames of every utterance.
The improvement using oracle masks is especially noticeable at medium and low SNRs. Thereby, the mismatch due to noise can be effectively reduced with only knowledge of the masking pattern.
Under proper knowledge of the masking pattern, the mismatch introduced by the noise can be significantly palliated just by a suitable exploitation of source correlations.
The proposed reconstruction techniques suffer little degradation with respect to the clean condition.
This poor performance is largely due to the simple noise estimation technique employed, which can not suitable account for non-stationary noise. However, this simple noise estimator has been useful to demonstrate the utility of the proposed temporal modeling for the MD approaches, which is able to improve the ASR performance over other MD systems consistently using the estimated masks as well as the oracle masks.