This thesis mainly discussed various aspects of using binary masking to improve the speech intelligibility. Several sections in the Introduction part give detailed systematic reviews for the binary masking technique:
4. Binary Masking
5. Sparsity of Speech
6. Oracle Masks
7. Application of the Binary Mask
8. Time-Frequency Masking
The noise robustness is formulated as a source separation task, “the cocktail party problem”, that separates the target speech from the interfering noise sounds.
Speech separation enhances the speech and reduces the background noise before transmission.
Human auditory system efficiently identifies and separates the sources prior to recognition at a higher level.
The decreased intelligibility can be compensated either by separating the target speech from the interfering sounds, by enhancement of the target speech, or by reducing the interfering sound.
Speech is robust and redundant which means that part of the speech sound can be lost or modified without negative impact on intelligibility. [Miller, George A., and J. C. R. Licklider. “The intelligibility of interrupted speech.” The Journal of the Acoustical Society of America 22 (1950): 167.][Warren, Richard M. “Perceptual restoration of missing speech sounds.”Science 167.3917 (1970): 392-393.][Howard‐Jones, Paul A., and Stuart Rosen. “Uncomodulated glimpsing in ‘‘checkerboard’’noise.” The Journal of the Acoustical Society of America 93 (1993): 2915.]
In binary masking, sound sources are assigned as either target or interferer in the time-frequency domain. The target sound (speech) is kept by using the value one in the binary mask, whereas the regions with the interferer are removed by using the value zero.
In short, binary masking is a method of applying a binary, frequency-dependent, and time-varying gain in a number of frequency channels, and the binary mask defines what to do when.
Estimation of the binary mask and application of the binary mask to carry out the source separation.
Oracle mask is used for binary masks calculated using a prior knowledge which is not available in most real-life applications. A major objection to the concept of oracle masks is that it is of no use in real-life applications because of the required a priori knowledge. However, the oracle masks establish an upper limit of performance, which makes them useful as references and goals for binary masking algorithms developed for real-life applications such as hearing aids.
The local SNR criterion is the threshold for classifying the time-frequency unit as dominated by the target or interferer sound and this threshold controls the amount of ones in the ideal binary mask.
The target binary mask can be calculated by comparing the target speech directly with the long-term average spectrum of the target speech.
The ideal binary mask requires the interferer to be available and will change depending on the type of interferer, whereas the target binary mask is calculated from the target sound only and therefore is independent of the interferer sound.
The masking is applied on the 64D Gammatone Filterbank on the ERB frequency scale features. 64 frequency channels are enough to achieve high intelligibility[Li, Ning, and Philipos C. Loizou. “Effect of spectral resolution on the intelligibility of ideal binary masked speech.” The Journal of the Acoustical Society of America 123.4 (2008): EL59-EL64.]. (Power or magnitude spectrum?) How it differs from the conventional mel triangular filterbank features?
Another set of features is the magnitude of the equal distance STFT frequencies.
Because the Gammatone filterbank resembles the processing in the human auditory system, it is often used for speech processing and perceptual studies. The STFT can also be used but has the drawback of requiring more frequency channels to obtain the same spectral resolution at low frequencies than the Gammatone filterbank.
The ideal binary mask has been shown to enable increased intelligibility in the ideal situation, whereas the Wiener filter, when tested under realistic conditions, shows an increase in quality while in most situations only preserving intelligibility.