Abstract
We present a new model-based monaural speech separation technique for separating two speech signals from a single recording of their mixture. This work is an attempt to solve a fundamental limitation in current model-based monaural speech separation techniques in which it is assumed that the data used in the training and test phases of the separation model have the same energy level. To overcome this limitation, a gain adapted minimum mean square error estimator is derived which estimates sources under different signal-to-signal ratios. Specifically, the speakers’ gains are incorporated as unknown parameters into the separation model and then the estimator is derived in terms of the source distributions and the signal-to-signal ratio. Experimental results show that the proposed system improves the separation performance significantly when compared with a similar model without gain adaptation as well as a maximum likelihood estimator with gain estimation.
Similar content being viewed by others
References
Camastra, F., & Vinciarelli, A. (2007). Machine learning for audio, image and video analysis: Theory and applications (advanced information and knowledge processing). New York: Springer.
Wang, D., & Brown, G. J. (Eds.) (2006). Computational auditory scene analysis: Principles, algorithms, and applications. New York: IEEE/Wiley-Interscience.
Ellis, D. (2006). Model-based scene analysis. In D. Wang & G. Brown (Eds.), Computational auditory scene analysis: Principles, algorithms, and applications. New York: Wiley/IEEE Press.
Roweis, S. (2000). One microphone source separation. In Proc. Neural Inf. Process. Syst. (pp. 793–799).
Rowies, S. T. (2003). Factorial models and refiltering for speech separation and denoising. In EUROSPEECH–03 (Vol. 7, pp. 1009–1012), May.
Radfar, M. H., & Dansereau, R. M. (2007). Single channel speech separation using soft mask filtering. IEEE Transactions on Audio, Speech and Language Processing, 15(8), 2299–2310, Nov.
Radfar, M. H., & Dansereau, R. M. (2007). Long-term gain estimation in model-based single channel speech separation. In Proc. IEEE workshop on applications of signal processing to audio and acoustics (WASPAA2007). New Paltz, New York, October.
Radfar, M. H., & Dansereau, R. M. (2007). Single channel speech separation using minimum mean square error estimation of sources’ log spectra. In Proc. IEEE international workshop on machine learning for signal processing (MLSP’2007 Thessalonike, Greece), Aug.
Radfar, M. H., Dansereau, R. M., & Sayadiyan, A. (2006). Performance evaluation of three features for model-based single channel speech separation problem. In Interspeech 2006, Intern. Conf. on Spoken Language Processing (ICSLP’2006 Pittsburgh, USA) (pp. 17–21), Sept.
Radfar, M. H., & Dansereau, R. M. (2007). Single channel speech separation using maximum a posteriori estimation. In Proc. international conference on spoken language processing (Interspeech– ICSLP 07). Antwerp, Belgium, Aug.
Schmidt, M. N., & Olsson, R. K. (2007). Linear regression on sparse features for single-channel speech separation. In Proc. IEEE workshop on applications of signal processing to audio and acoustics (WASPAA2007) (pp. 26–29). New Paltz, New York, October.
Reddy, A. M., & Raj, B. (2007). Soft mask methods for single-channel speaker separation. Audio, Speech and Language Processing, IEEE Transactions on, 15(6), 1766–1776, Aug.
Weiss, R., & Ellis, D. (2006). Estimating single-channel source separation masks: Relevance vector machine classifiers vs. pitch-based masking. In Proc. workshop on statistical and perceptual audition SAPA-06 (pp. 31–36), Oct.
Beierholm, T., Pedersen, B. D., & Winther, O. (2004). Low complexity Bayesian single channel source separation. In Proc. ICASSP–04 (Vol. 5, pp. 529–532), May.
Kristjansson, T., Attias, T. H., & Hershey, J. (2004). Single microphone source separation using high resolution signal reconstruction. In Proc. ICASSP–04 (pp. 817–820), May.
Radfar, M. H., Dansereau, R. M., & Sayadiyan (2007). A maximum likelihood estimation of vocal-tract-related filter characteristics for single channel speech separation. EURASIP Journal on Audio, Speech, and Music Processing, 2007, 15, 84186. doi:10.1155/2007/84186.
Radfar, M. H., Dansereau, R. M., & Sayadiyan, A. (2007). Monaural speech segregation based on fusion of source-driven with model-driven techniques. Speech Communication, 49(6), 464–476, June.
Reyes-Gomez, M. J., Ellis, D., & Jojic, N. (2004). Multiband audio modeling for single channel acoustic source separation. In Proc. ICASSP–04 (Vol. 5, pp. 641–644), May.
Reddy, A. M., & Raj, B. (2004). A minimum mean squared error estimator for single channel speaker separation. In INTERSPEECH–2004 (pp. 2445–2448), Oct.
Brown, G. J., & Wang, D. L. (2005). Separation of speech by computational auditory scene analysis. In Speech enhancement (pp. 371–402). New York: Springer.
Brown, G. J., & Cooke, M. (1994). Auditory scene analysis. Computer Speech and Language, 8(4), 297–336.
Cooke, M., & Ellis, D. P. W. (2001). The auditory organization of speech and other sources in listeners and computational models. Speech Communication, 35(3), 141–177, October.
Ellis, D. P. W. (1999). Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures. Speech Communication, 27(3), 281–298, April.
Nakatani, T., & Okuno, H. G. (1999). Harmonic sound stream segregation using localization and its application to speech stream segregation. Speech Communication, 27(3), 209–222, April.
Brown, G. J., & Wang, D. L. (2005). Separation of speech by computational auditory scene analysis. In J. Benesty, S. Makino, & J. Chen (Eds.), Speech enhancement (pp. 371–402). New York: Springer.
Darwin, C. J., & Carlyon, R. P. (1995). Auditory grouping. In B. C. J. Moore (Ed.), The handbook of perception and cognition (Vol. 6, chapter Hearing, pp. 387–424). London: Academic.
Wang, D. L., & Brown, G. J. (1999). Separation of speech from interfering sounds based on oscillatory correlation. IEEE Transactions on Neural Networks, 10, 684–697, May.
Hu, G., & Wang, D. L. (2004). Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Transactions on Neural Networks, 15(5), 1135–1150, Sept.
Bregman, A. S. (1994). Computatinal auditory scene analysis. Cambridge: MIT.
Li, Y., Amari, S., Cichocki, A., Ho, D. W. C., & Shengli, X. (2006). Underdetermined blind source separation based on sparse representation. IEEE Transactions on Speech Audio Processing, 54(2), 423–437, Feb.
Theis, F. J., Puntonet, C. G., & Lang, E. W. (2006). Median-based clustering for underdetermined blind signal processing. IEEE Signal Processing Letters, 13(2), 96–99, Feb.
Bofill, P., & Zibulevsky, M. (2001). Underdetermined blind source separation using sparse representations. Signal Process, 81, 2353–2362.
Jutten, C., & Herault, J. (1991). Blind separation of sources, Part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–10.
Common, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314.
Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159.
Amari, S. I., & Cardoso, J. F. (1997). Blind source separation–semiparametric statistical approach. IEEE Transactions on Signal Processing, 45(11), 2692–2700.
Luo, Y., Wang, W., Chambers, J. A., Lambotharan, S., & Proudler, I. K. (2006). Exploitation of source non-stationarity in underdetermined blind source separation with advanced clustering techniques. IEEE Transactions Signal Processing, 54(6), 2198–2212, June.
Lewicki, M. S., & Sejnowski, T. J. (1998). Learning nonlinear overcomplete representations for efficient coding. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems (Vol. 10). Cambridge: MIT.
Schmidt, M. N., & Olsson, R. K. (2006). Single-channel speech separation using sparse non-negative matrix factorization. In Proc. Interspeech 2006, Intern. Conf. on Spoken Language Processing (ICSLP’2006 Pittsburgh), Sept.
Virtanen, T. (2003). Sound source separation using sparse coding with temporal continuity objective. In Proc. Int. Comput. Music Conference (pp. 231–234).
Jang, G. J., & Lee, T. W. (2003). A probabilistic approach to single channel source separation. In Proc. Advances in Neural Inform. Process. Systems (pp. 1173–1180).
Radfar, M. H., Banihashemi, A. H., Dansereau, R. M., & Sayadiyan, A. (2006). A non-linear minimum mean square error estimator for the mixture-maximization approximation. Electronic Letters, 42(12), 75–76, June.
Ephraim, Y. (1992). Gain-adapted hidden markov models for recognition of clean andnoisy speech. IEEE Transactions on Audio, Speech and Language Processing, 40(6), 1303–1316, Jun.
Zhao, D. Y., & Kleijn, W. B. (2007). Hmm-based gain modeling for enhancement of speech in noise. IEEE Transactions on Audio, Speech and Language Processing, 15(3), 882–892, March.
Benaroya, L., Bimbot, F., & Gribonval, R. (2006). Audio source separation with a single sensor. IEEE Transactions on Speech Audio Processing, 14(1), 191–199, Jan.
Parsons, T. W. (1976). Separation of speech from interfering speech by means of harmonic selection. Journal of the Acoustical Society of America, 60, 911–918, Aug.
Kameoka, H., Nishimoto, T., & Sagayama, S. (2004). Multi-pitch trajectory estimation of concurrent speech based on harmonic GMM and nonlinear kalman filtering. In INTERSPEECH-2004 (Vol. 1, pp. 2433–2436), Oct.
de Cheveigné, A., & Kawahara, H. (1999). Multiple period estimation and pitch perception model. Speech Communication, 27, 175–185, April.
Kwon, Y. H., Park, D. J., & Ihm, B. C. (2000). Simplified pitch detection algorithm of mixed speech signals. In Proc. ISCAS–83 (Vol. 3, pp. 722–725), May.
Wu, M., Wang, D. L., & Brown, G. J. (2003). A multipitch tracking algorithm for noisy speech. IEEE Transactions on Acoustics, Speech, and Signal Process, 11(3), 229–241, May.
Tolonen, D., & Karjalainen, M. (2000). A computationally efficient multipitch analysis model. IEEE Transactions on Acoustics, Speech, and Signal Process, 8, 708–716, Nov.
Chazan, D., Stettiner, Y., & Malah, D. (1993). Optimal multi-pitch estimation using the EM algorithm for co-channel speech separation. In Proc. ICASSP–93 (pp. 728–731), April.
Weintraub, M. (1986). A computational model for separating two simultaneous talkers. In Proc. ICASSP–86 (Vol. 11, pp. 81–84), April.
Hanson, B. A., & Wong, D. Y. (1984). The harmonic magnitude suppression (HMS) technique for intelligibility enhancement in the presence of interfering speech. In Proc. ICASSP–84 (Vol. 9, pp. 65–68), Mar.
Morgan, D. P., George, E. B., Lee, L. T., & Key, S. M. (1997). Cochannel speaker separation by harmonic enhancment and suppression. IEEE Transactions in Acoustics, Speech, and Signal Process, 5(5), 407–424, Sept.
Kanjilal, P. P., & Palit, S. (1994). Extraction of multiple periodic waveforms from noisy data. In Proc. ICASSP-94 ( Vol. 2, pp. 361–364), April.
Ephraim, Y., & Merhav, N. (1992). Lower and upper bounds on the minimum mean-square error in composite source signal estimation. IEEE Transactions on Information Theory, 38(6), 1709–1724, Nov.
Nadas, A., Nahamoo, D., & Picheny, M. A. (1989). Speech recognition using noise-adaptive prototypes. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(10), 1495–1503, Oct.
Papoulis, A. (1991). Probability, random variables, and stochastic processes. New York: McGraw-Hill.
Bradie, B. (2006). A friendly introduciton to numerical analysis. Englewood Cliffs: Pearson Prentice Hall.
Rabiner, L. R., & Schafer, R. W. (1978). Digital processing of speech signals. Englewood Cliffs: Prentice-Hall.
Cooke, M. P., Barker, J., Cunningham, S. P., & Shao, X. (2005). An audio-visual corpus for speech perception and automatic speech recognition. JASA, Nov.
Spiegel, M. R. (1998). Schaum’s mathematical handbook of formulas and tables (2nd edn). New York: McGraw-Hill, June.
Acknowledgements
The authors wish to thank the Natural Sciences and Engineering Research Council (NSERC) of Canada for funding this project. Also, the authors would like to thank the reviewers for their valuable comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of this paper was presented at the IEEE Workshop on Machine Learning for Signal Processing (MLSP) held in Thessaloniki, Greece in August 2007.
Appendices
Appendix A: Proof of (24)
In this appendix we show that \(p(x_{1}(d)|{\mathbf y},\theta,k_1^r,k_2^r)\) is given by Eq. 24. Using Bayes’ theorem and the probability chain rule and noting that the sources are independent, we can express \(p(x_1^r(d)|{\mathbf y},\theta,k_1^r,k_2^r)\) in terms of the observation probability and the prior probability of the sources. Thus, we have
Substituting Eqs. 14 and 15 into Eq. 27, we obtain
In order to solve the integrations in Eq. 28 we consider the following conditions.
Thus, under condition I, we have
The term in the \(\exp(\cdot)\) function in Eq. 30 can be rewritten as
Noting that the last term in Eq. 31 is independent of x 1(d) and hence cancelled out, we GET
The integration in the denominator is simply calculated as
Substituting Eq. 33 into Eq. 32, we arrive at a Gaussian distribution in the form
In a similar fashion, under condition II Eq. 28 reduces to
Hence, Eqs. 34 and 35 give Eq. 23 in Section 3.
Appendix B: Quadratic Algorithm for Finding θ *
In this Appendix, the procedure for finding θ * given in Eq. 22 is described. The goal is to find a value at which the global maximum of Q(θ) occurs. Since Q(θ) is well-approximated by a convex function, a quadratic optimization approach can be used to find θ * with small numbers of iterations. (see Table 1 for the numerical results). Figure 10 shows a typical form of Q(θ) which eases the understanding of the algorithm presented in Table 2. Concisely speaking, the algorithm works as follows. First, the coordinates of three points of Q(θ) is determined, lets say {(θ l ,A), (θ c ,C),(θ r ,B)}. Then, a quadratic function of the form f(x) = ax 2 + bx + c is fitted to these three points and \(x^*=-\frac{b}{2a}\) which maximizes f(x) is obtained in terms of {(θ l ,A), (θ c ,C),(θ r ,B)} using a function called Quadratic(·), that is, \(x^*=Quadratic(\theta_l,A,\theta_r,B,\theta_c,C)\). Next, using x * and Q(x *), the coordinates {(θ l ,A), (θ c ,C),(θ r ,B)} are updated until a value of θ c is reached such that Q(θ l ) ≤ Q(θ c ) ≥ Q(θ r ).
Appendix C: Derivation of (10)
In this appendix, we show how the computation of the right-hand side of Eq. 9 leads to Eq. 10, that is
The procedure is similar to that presented in [42] except that the source signal gains, g 1 and g 2, are incorporated. Let
and
represent the magnitudes of the D-point discrete Fourier transforms of sources and the observation signal, respectively. Let \(\phi^r=\angle\mathcal{F}_{\!D}\Bigl(\!\{X_i(t)\}_{t=(r-1)M}^{(r-1)M+N-1}\!\Bigr)-\) \(\angle\mathcal{F}_{\!D}\Bigl(\!\{X_i(t)\}_{t=(r-1)M}^{(r-1)M+N-1}\!\Bigr)\!=\![\phi^r(0),\ldots,\phi^r(d),\ldots, \phi^r(D\!-1)\!]^{\top}\) where \(\angle\) denotes the phase operator. Given Eq. 2, the relation between the log magnitude of \(\dot{y}^r_i(d)\) and those of \(\dot{x}^r_i(d)\), i ∈ {1,2} is given by
The goal is to obtain the MMSE estimate of y r(d) given \(x_1^r(d)\), \(x_2^r(d)\), g 1, and g 2. Mathematically, the MMSE estimator is expressed as
From Eq. 37 and initially assuming that \(\dot{x}_1^r(d)\),\(\dot{x}_2^r(d)\), g 1, and g 2 are given, the only random variable on the right-hand side of Eq. 37 is φ r(d) which, as shown in [42], can be modeled by a uniform distribution over the interval [ − π;π]; that is \(p\bigl(\phi^r(d)\bigr)=\frac{1}{2\pi}\) where \(p\bigl(\phi^r(d)\bigr)\) denotes the PDF of φ r(d). Therefore,
The above integration is computed using an integration table (e.g. [63, pp. 111]) and the result is
Noting that log10 g 1 = h(θ), log10 g 2 = h( − θ), and \(\log_{10}\bigl(\dot{x}_i^r(d)\bigr)\!=\!x_i^r(d)\), i ∈ {1,2}, Eq. 40 can be rewritten as
which is identical to Eq. 10.
Rights and permissions
About this article
Cite this article
Radfar, M.H., Dansereau, R.M. & Chan, WY. Monaural Speech Separation Based on Gain Adapted Minimum Mean Square Error Estimation. J Sign Process Syst 61, 21–37 (2010). https://doi.org/10.1007/s11265-008-0274-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-008-0274-7