A lot of efforts have been made in the Ideal Binary Mask (IBM) estimation via statistical learning methods. The Bayesian method is a common one. However, one drawback is that the mask is estimated for each time-frequency unit independently. The correlation between units has not been fully taken into account. This paper attempts to consider the local correlation information between the mask labels of adjacent units directly. It is derived from a demonstrated assumption that units which belong to one segment are mainly dominated by one source. On the other hand, a local noise level tracking stage is incorporated. The local level is obtained by averaging among several adjacent units and can be considered by averaging among several adjacent units and can be considered as an approach to true noise energy. It is used as the intermediary auxiliary variable to indicate the correlation. While some secondary factors are omitted, the high dimensional posterior distribution is simulated by a Markov Chain Monte Carlo method.
The main computation goal of CASA has been set to obtain the ideal binary mask.
This paper uses a T-F representation of a bank of auditory filters in the form on a cochleagram. Under the T-F representation, the concept of IBM is directly motivated by the auditory masking phenomenon. Roughly speaking, the louder sound causes the weaker sound inaudible within a critical band.
The threshold LC stands for local signal-to-noise ratio in dB. Varying LC leads to different IBMs and many researchers focus on the selection of this threshold. In , the authors suggested that the IBM defined by -6 dB criterion produces dramatically intelligibility improvement. The study in ,  showed that IBM gives the optimal SNR gain under 0 dB threshold. Generally, we could start with 0 dB and vary it unless necessary.
The input signal is decomposed into frequency domain with 64-channel gammatone filters which are standard model of cochlear filtering. The center frequencies equally distributed on the rectangular bandwidth scale from 50 Hz to 8000 Hz.
IBM estimation which is the main goal of CASA can be viewed as a binary classification problem.
Extracting accurate pitch contours from mixtures will improve the IBM estimation greatly.
This paper focus on the IBM estimation while pitch is given.
In this paper, T-F segmentation and the noise level tracking are used to depict the correlation between adjacent units from different perspectives.