Spectral features are particularly sensitive to gender differences. This is illustrated in Figure 2 which has been constructed by taking a clean speech utterance and force-aligning it to single-component Gaussian models trained on either male speakers or female speakers. By outputting means of each model state in the alignment it is possible to effectively reconstruct the spectral representation as if the speech had been spoken by either the prototypical male or prototypical female speaker. This technique allows us to clearly see the differences in what the models have learnt about male and female speech.
Three masking techniques are investigated:
1) discrete SNR masks: Specially, the filter bank features are converted into the spectral amplitude domain, the first ten frames are averaged to form a stationary noise estimate and this estimate is subtracted from the noisy signal to form a clean signal estimate. The ratio of these two estimates forms the local SNR. Features are labelled “reliable” if they have a local SNR greater than a threshold of 7 dB, otherwise they are labelled as “unreliable”. The 7dB threshold has been empiricially shown to be near optimal in previous work using different data and different noise types.
2) Soft SNR Masks: the values for the soft masks have been generated by compressing the local SNR with a sigmoid function with empirically derived parameters.
3) Combined harmonicity and SRN masks: details refer to the paper.