Monaural Speech Separation Based on Gain Adapted Minimum Mean Square Error Estimation | Journal of Signal Processing Systems Skip to main content
Log in

Monaural Speech Separation Based on Gain Adapted Minimum Mean Square Error Estimation

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

We present a new model-based monaural speech separation technique for separating two speech signals from a single recording of their mixture. This work is an attempt to solve a fundamental limitation in current model-based monaural speech separation techniques in which it is assumed that the data used in the training and test phases of the separation model have the same energy level. To overcome this limitation, a gain adapted minimum mean square error estimator is derived which estimates sources under different signal-to-signal ratios. Specifically, the speakers’ gains are incorporated as unknown parameters into the separation model and then the estimator is derived in terms of the source distributions and the signal-to-signal ratio. Experimental results show that the proposed system improves the separation performance significantly when compared with a similar model without gain adaptation as well as a maximum likelihood estimator with gain estimation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9

Similar content being viewed by others

References

  1. Camastra, F., & Vinciarelli, A. (2007). Machine learning for audio, image and video analysis: Theory and applications (advanced information and knowledge processing). New York: Springer.

    Google Scholar 

  2. Wang, D., & Brown, G. J. (Eds.) (2006). Computational auditory scene analysis: Principles, algorithms, and applications. New York: IEEE/Wiley-Interscience.

    Google Scholar 

  3. Ellis, D. (2006). Model-based scene analysis. In D. Wang & G. Brown (Eds.), Computational auditory scene analysis: Principles, algorithms, and applications. New York: Wiley/IEEE Press.

    Google Scholar 

  4. Roweis, S. (2000). One microphone source separation. In Proc. Neural Inf. Process. Syst. (pp. 793–799).

  5. Rowies, S. T. (2003). Factorial models and refiltering for speech separation and denoising. In EUROSPEECH–03 (Vol. 7, pp. 1009–1012), May.

  6. Radfar, M. H., & Dansereau, R. M. (2007). Single channel speech separation using soft mask filtering. IEEE Transactions on Audio, Speech and Language Processing, 15(8), 2299–2310, Nov.

    Article  Google Scholar 

  7. Radfar, M. H., & Dansereau, R. M. (2007). Long-term gain estimation in model-based single channel speech separation. In Proc. IEEE workshop on applications of signal processing to audio and acoustics (WASPAA2007). New Paltz, New York, October.

  8. Radfar, M. H., & Dansereau, R. M. (2007). Single channel speech separation using minimum mean square error estimation of sources’ log spectra. In Proc. IEEE international workshop on machine learning for signal processing (MLSP’2007 Thessalonike, Greece), Aug.

  9. Radfar, M. H., Dansereau, R. M., & Sayadiyan, A. (2006). Performance evaluation of three features for model-based single channel speech separation problem. In Interspeech 2006, Intern. Conf. on Spoken Language Processing (ICSLP’2006 Pittsburgh, USA) (pp. 17–21), Sept.

  10. Radfar, M. H., & Dansereau, R. M. (2007). Single channel speech separation using maximum a posteriori estimation. In Proc. international conference on spoken language processing (Interspeech– ICSLP 07). Antwerp, Belgium, Aug.

  11. Schmidt, M. N., & Olsson, R. K. (2007). Linear regression on sparse features for single-channel speech separation. In Proc. IEEE workshop on applications of signal processing to audio and acoustics (WASPAA2007) (pp. 26–29). New Paltz, New York, October.

  12. Reddy, A. M., & Raj, B. (2007). Soft mask methods for single-channel speaker separation. Audio, Speech and Language Processing, IEEE Transactions on, 15(6), 1766–1776, Aug.

    Article  Google Scholar 

  13. Weiss, R., & Ellis, D. (2006). Estimating single-channel source separation masks: Relevance vector machine classifiers vs. pitch-based masking. In Proc. workshop on statistical and perceptual audition SAPA-06 (pp. 31–36), Oct.

  14. Beierholm, T., Pedersen, B. D., & Winther, O. (2004). Low complexity Bayesian single channel source separation. In Proc. ICASSP–04 (Vol. 5, pp. 529–532), May.

  15. Kristjansson, T., Attias, T. H., & Hershey, J. (2004). Single microphone source separation using high resolution signal reconstruction. In Proc. ICASSP–04 (pp. 817–820), May.

  16. Radfar, M. H., Dansereau, R. M., & Sayadiyan (2007). A maximum likelihood estimation of vocal-tract-related filter characteristics for single channel speech separation. EURASIP Journal on Audio, Speech, and Music Processing, 2007, 15, 84186. doi:10.1155/2007/84186.

    Article  Google Scholar 

  17. Radfar, M. H., Dansereau, R. M., & Sayadiyan, A. (2007). Monaural speech segregation based on fusion of source-driven with model-driven techniques. Speech Communication, 49(6), 464–476, June.

    Article  Google Scholar 

  18. Reyes-Gomez, M. J., Ellis, D., & Jojic, N. (2004). Multiband audio modeling for single channel acoustic source separation. In Proc. ICASSP–04 (Vol. 5, pp. 641–644), May.

  19. Reddy, A. M., & Raj, B. (2004). A minimum mean squared error estimator for single channel speaker separation. In INTERSPEECH–2004 (pp. 2445–2448), Oct.

  20. Brown, G. J., & Wang, D. L. (2005). Separation of speech by computational auditory scene analysis. In Speech enhancement (pp. 371–402). New York: Springer.

    Chapter  Google Scholar 

  21. Brown, G. J., & Cooke, M. (1994). Auditory scene analysis. Computer Speech and Language, 8(4), 297–336.

    Article  Google Scholar 

  22. Cooke, M., & Ellis, D. P. W. (2001). The auditory organization of speech and other sources in listeners and computational models. Speech Communication, 35(3), 141–177, October.

    Article  MATH  Google Scholar 

  23. Ellis, D. P. W. (1999). Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures. Speech Communication, 27(3), 281–298, April.

    Article  Google Scholar 

  24. Nakatani, T., & Okuno, H. G. (1999). Harmonic sound stream segregation using localization and its application to speech stream segregation. Speech Communication, 27(3), 209–222, April.

    Article  Google Scholar 

  25. Brown, G. J., & Wang, D. L. (2005). Separation of speech by computational auditory scene analysis. In J. Benesty, S. Makino, & J. Chen (Eds.), Speech enhancement (pp. 371–402). New York: Springer.

    Chapter  Google Scholar 

  26. Darwin, C. J., & Carlyon, R. P. (1995). Auditory grouping. In B. C. J. Moore (Ed.), The handbook of perception and cognition (Vol. 6, chapter Hearing, pp. 387–424). London: Academic.

    Google Scholar 

  27. Wang, D. L., & Brown, G. J. (1999). Separation of speech from interfering sounds based on oscillatory correlation. IEEE Transactions on Neural Networks, 10, 684–697, May.

    Article  Google Scholar 

  28. Hu, G., & Wang, D. L. (2004). Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Transactions on Neural Networks, 15(5), 1135–1150, Sept.

    Article  Google Scholar 

  29. Bregman, A. S. (1994). Computatinal auditory scene analysis. Cambridge: MIT.

    Google Scholar 

  30. Li, Y., Amari, S., Cichocki, A., Ho, D. W. C., & Shengli, X. (2006). Underdetermined blind source separation based on sparse representation. IEEE Transactions on Speech Audio Processing, 54(2), 423–437, Feb.

    Google Scholar 

  31. Theis, F. J., Puntonet, C. G., & Lang, E. W. (2006). Median-based clustering for underdetermined blind signal processing. IEEE Signal Processing Letters, 13(2), 96–99, Feb.

    Article  Google Scholar 

  32. Bofill, P., & Zibulevsky, M. (2001). Underdetermined blind source separation using sparse representations. Signal Process, 81, 2353–2362.

    Article  MATH  Google Scholar 

  33. Jutten, C., & Herault, J. (1991). Blind separation of sources, Part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–10.

    Article  MATH  Google Scholar 

  34. Common, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314.

    Article  Google Scholar 

  35. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159.

    Article  Google Scholar 

  36. Amari, S. I., & Cardoso, J. F. (1997). Blind source separation–semiparametric statistical approach. IEEE Transactions on Signal Processing, 45(11), 2692–2700.

    Article  Google Scholar 

  37. Luo, Y., Wang, W., Chambers, J. A., Lambotharan, S., & Proudler, I. K. (2006). Exploitation of source non-stationarity in underdetermined blind source separation with advanced clustering techniques. IEEE Transactions Signal Processing, 54(6), 2198–2212, June.

    Article  Google Scholar 

  38. Lewicki, M. S., & Sejnowski, T. J. (1998). Learning nonlinear overcomplete representations for efficient coding. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems (Vol. 10). Cambridge: MIT.

    Google Scholar 

  39. Schmidt, M. N., & Olsson, R. K. (2006). Single-channel speech separation using sparse non-negative matrix factorization. In Proc. Interspeech 2006, Intern. Conf. on Spoken Language Processing (ICSLP’2006 Pittsburgh), Sept.

  40. Virtanen, T. (2003). Sound source separation using sparse coding with temporal continuity objective. In Proc. Int. Comput. Music Conference (pp. 231–234).

  41. Jang, G. J., & Lee, T. W. (2003). A probabilistic approach to single channel source separation. In Proc. Advances in Neural Inform. Process. Systems (pp. 1173–1180).

  42. Radfar, M. H., Banihashemi, A. H., Dansereau, R. M., & Sayadiyan, A. (2006). A non-linear minimum mean square error estimator for the mixture-maximization approximation. Electronic Letters, 42(12), 75–76, June.

    Article  Google Scholar 

  43. Ephraim, Y. (1992). Gain-adapted hidden markov models for recognition of clean andnoisy speech. IEEE Transactions on Audio, Speech and Language Processing, 40(6), 1303–1316, Jun.

    MATH  MathSciNet  Google Scholar 

  44. Zhao, D. Y., & Kleijn, W. B. (2007). Hmm-based gain modeling for enhancement of speech in noise. IEEE Transactions on Audio, Speech and Language Processing, 15(3), 882–892, March.

    Article  Google Scholar 

  45. Benaroya, L., Bimbot, F., & Gribonval, R. (2006). Audio source separation with a single sensor. IEEE Transactions on Speech Audio Processing, 14(1), 191–199, Jan.

    Article  Google Scholar 

  46. Parsons, T. W. (1976). Separation of speech from interfering speech by means of harmonic selection. Journal of the Acoustical Society of America, 60, 911–918, Aug.

    Article  Google Scholar 

  47. Kameoka, H., Nishimoto, T., & Sagayama, S. (2004). Multi-pitch trajectory estimation of concurrent speech based on harmonic GMM and nonlinear kalman filtering. In INTERSPEECH-2004 (Vol. 1, pp. 2433–2436), Oct.

  48. de Cheveigné, A., & Kawahara, H. (1999). Multiple period estimation and pitch perception model. Speech Communication, 27, 175–185, April.

    Article  Google Scholar 

  49. Kwon, Y. H., Park, D. J., & Ihm, B. C. (2000). Simplified pitch detection algorithm of mixed speech signals. In Proc. ISCAS–83 (Vol. 3, pp. 722–725), May.

  50. Wu, M., Wang, D. L., & Brown, G. J. (2003). A multipitch tracking algorithm for noisy speech. IEEE Transactions on Acoustics, Speech, and Signal Process, 11(3), 229–241, May.

    Google Scholar 

  51. Tolonen, D., & Karjalainen, M. (2000). A computationally efficient multipitch analysis model. IEEE Transactions on Acoustics, Speech, and Signal Process, 8, 708–716, Nov.

    Article  Google Scholar 

  52. Chazan, D., Stettiner, Y., & Malah, D. (1993). Optimal multi-pitch estimation using the EM algorithm for co-channel speech separation. In Proc. ICASSP–93 (pp. 728–731), April.

  53. Weintraub, M. (1986). A computational model for separating two simultaneous talkers. In Proc. ICASSP–86 (Vol. 11, pp. 81–84), April.

  54. Hanson, B. A., & Wong, D. Y. (1984). The harmonic magnitude suppression (HMS) technique for intelligibility enhancement in the presence of interfering speech. In Proc. ICASSP–84 (Vol. 9, pp. 65–68), Mar.

  55. Morgan, D. P., George, E. B., Lee, L. T., & Key, S. M. (1997). Cochannel speaker separation by harmonic enhancment and suppression. IEEE Transactions in Acoustics, Speech, and Signal Process, 5(5), 407–424, Sept.

    Article  Google Scholar 

  56. Kanjilal, P. P., & Palit, S. (1994). Extraction of multiple periodic waveforms from noisy data. In Proc. ICASSP-94 ( Vol. 2, pp. 361–364), April.

  57. Ephraim, Y., & Merhav, N. (1992). Lower and upper bounds on the minimum mean-square error in composite source signal estimation. IEEE Transactions on Information Theory, 38(6), 1709–1724, Nov.

    Article  MATH  MathSciNet  Google Scholar 

  58. Nadas, A., Nahamoo, D., & Picheny, M. A. (1989). Speech recognition using noise-adaptive prototypes. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(10), 1495–1503, Oct.

    Article  Google Scholar 

  59. Papoulis, A. (1991). Probability, random variables, and stochastic processes. New York: McGraw-Hill.

    Google Scholar 

  60. Bradie, B. (2006). A friendly introduciton to numerical analysis. Englewood Cliffs: Pearson Prentice Hall.

    Google Scholar 

  61. Rabiner, L. R., & Schafer, R. W. (1978). Digital processing of speech signals. Englewood Cliffs: Prentice-Hall.

    Google Scholar 

  62. Cooke, M. P., Barker, J., Cunningham, S. P., & Shao, X. (2005). An audio-visual corpus for speech perception and automatic speech recognition. JASA, Nov.

  63. Spiegel, M. R. (1998). Schaum’s mathematical handbook of formulas and tables (2nd edn). New York: McGraw-Hill, June.

    Google Scholar 

Download references

Acknowledgements

The authors wish to thank the Natural Sciences and Engineering Research Council (NSERC) of Canada for funding this project. Also, the authors would like to thank the reviewers for their valuable comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. H. Radfar.

Additional information

A preliminary version of this paper was presented at the IEEE Workshop on Machine Learning for Signal Processing (MLSP) held in Thessaloniki, Greece in August 2007.

Appendices

Appendix A: Proof of (24)

In this appendix we show that \(p(x_{1}(d)|{\mathbf y},\theta,k_1^r,k_2^r)\) is given by Eq. 24. Using Bayes’ theorem and the probability chain rule and noting that the sources are independent, we can express \(p(x_1^r(d)|{\mathbf y},\theta,k_1^r,k_2^r)\) in terms of the observation probability and the prior probability of the sources. Thus, we have

$$\begin{array}{lll} p\left(x_1^r(d)|{\mathbf y},\theta,k_1^r,k_1^r,k_2^r\right)&=&\displaystyle\frac{p\left(x_1^r(d),{\mathbf y},\theta,k_1^r,k_2^r\right)}{p\left({\mathbf y},\theta,k_1^r,k_2^r\right)}\nonumber\\ &=&\displaystyle\frac{\int_{x_2^r(d)} p\left(x_1^r(d),x_2^r(d),{\mathbf y},\theta,k_1^r,k_2^r\right)\,dx_2(d)}{\int_{x_1^r(d)}\int_{x_2^r(d)} p\left(x_1^r(d),x_2^r(d),{\mathbf y},\theta,k_1^r,k_2^r\right)\,dx_2(d)dx_1(d)}\nonumber\\ &=&\frac{\int_{x_2^r(d)} p\left({\mathbf y}|x_1^r(d),x_2^r(d),\theta,k_1^r,k_2^r\right) } {\int_{x_1^r(d)}\int_{x_2^r(d)} p\left({\mathbf y}|x_1^r(d),x_2^r(d),\theta,k_1^r,k_2^r\right)}\times \frac{p\left(x_1^r(d)|k_1^r\right)p\left(x_2^r(d)|k_2^r\right)\,dx_2(d) }{p\left(x_1^r(d)|k_1^r\right)p\left(x_2^r(d)|k_2^r\right)\,dx_2(d)dx_1(d)} \end{array}$$
(27)

Substituting Eqs. 14 and 15 into Eq. 27, we obtain

$$ p\left(x_1^r(d)|{\mathbf y},k_1^r,k_2^r\right) =\frac{\int_{x_2^r(d)} \exp\left(\frac{-\left(y(d)-f\bigl(x^r_{1}(d),x^r_{2}(d),\theta\bigr)\right)^2}{2\sigma^2(d)}\right)}{\int_{x_1^r(d)} \int_{x_2^r(d)} \exp\left(\frac{-\left(y(d)-f\bigl(x^r_{1}(d),x^r_{2}(d),\theta\bigr)\right)^2}{2\sigma^2(d)}\right)} \times \frac{\exp\left(\frac{-\bigl(x_1^r(d)-\mu_1^{k_1^r}(d)\bigr)^2}{2\sigma_1^{2k_1^r}(d)}\right) \exp\left(\frac{-\bigl(x_2^r(d)-\mu_2^{k_2^r}(d)\bigr)^2}{2\sigma_2^{2k_2^r}(d)}\right)\, dx_2(d)}{ \exp\left(\frac{-\bigl(x_1^r(d)-\mu_1^{k_1^r}(d)\bigr)^2}{2\sigma_1^{2k_1^r}(d)}\right) \exp\left(\frac{-\bigl(x_2^r(d)-\mu_2^{k_2^r}(d)\bigr)^2}{2\sigma_2^{2k_2^r}(d)}\right)\,dx_2(d)dx_1(d)}. $$
(28)

In order to solve the integrations in Eq. 28 we consider the following conditions.

$$\begin{array}{lll} f\bigl(x^r_{1}(d),x^r_{2}(d),\theta\bigr)= \begin{cases} x_1(d)+h(\theta),& \mu^{k_1^r}_{1}(d)+h(\theta)\geq \mu^{k_2^r}_{2}(d)+h(-\theta)\qquad \text{condition}\, I\\[5pt] x_2(d)+h(-\theta),& \mu^{k_1^r}_{1}(d)+h(\theta)< \mu^{k_2^r}_{2}(d)+h(-\theta)\qquad \text{condition}\, II \end{cases}. \end{array}$$
(29)

Thus, under condition I, we have

$$ p\left(x_1^r(d)|{\mathbf y},k_1^r,k_2^r\right) =\frac{ \exp\left(\frac{-\bigl(y(d)-x_1^r(d)-h(\theta)\bigr)^2}{2\sigma^2(d)} +\frac{-\bigl(x_1^r(d)-\mu_1^{k_1^r}(d)\bigr)^2}{2\sigma_1^{2k_1^r}(d)}\right)}{\int_{x_1^r(d)} \exp\left(\frac{-\bigl(y(d)-x_1^r(d)\bigr)^2}{2\sigma^2(d)}+ \frac{-\bigl(x_1^r(d)-\mu_1^{k_1^r}(d)\bigr)^2}{2\sigma_1^{2k_1^r}(d)}\right)dx_1(d)}. $$
(30)

The term in the \(\exp(\cdot)\) function in Eq. 30 can be rewritten as

$$ -\frac{\sigma_1^{2k_1^r}(d)+\sigma^2(d)}{2\sigma_1^{2k_1^r}(d)\sigma^2(d)} \times\left(x_1^r(d)-\frac{\sigma_1^{2k_1^r}(d)y(d)+\sigma_1^{2k_1^r}h(\theta)+\sigma^2(d)\mu_1^{k_1^r}(d)}{\sigma_1^{2k_1^r}(d)+\sigma^2(d)}\right)^2 -\frac{\left(y(d)-h(\theta)-\mu_1^{k_1^r}(d)\right)^2}{2\left(\sigma_1^{2k_1^r}(d)+\sigma^2(d)\right)}. $$
(31)

Noting that the last term in Eq. 31 is independent of x 1(d) and hence cancelled out, we GET

$$ p\left(x_1^r(d)|{\mathbf y},k_1^r,k_2^r\right) =\displaystyle \frac{\exp\left(-\frac{\sigma_1^{2k_1^r}(d)+\sigma^2(d)}{2\sigma_1^{2k_1^r}(d)\sigma^2(d)}\bigg(x_1^r(d)-\frac{\sigma_1^{2k_1^r}(d)y(d)-\sigma_1^{2k_1^r}(d)h(\theta)+\sigma^2(d)\mu_1^{k_1^r}(d)}{\sigma_1^{2k_1^r}(d)+\sigma^2(d)}\bigg)^2\right) }{\int_{x_1^r(d)}\exp\left(-\frac{\sigma_1^{2k_1^r}(d)+\sigma^2(d)}{2\sigma_1^{2k_1^r}(d)\sigma^2(d)}\bigg(x_1^r(d)-\frac{\sigma_1^{2k_1^r}(d)y(d)-\sigma_1^{2k_1^r}(d)h(\theta)+\sigma^2(d)\mu_1^{k_1^r}(d)}{\sigma_1^{2k_1^r}(d)+\sigma^2(d)}\bigg)^2\right)\,dx_1(d)}. $$
(32)

The integration in the denominator is simply calculated as

$$\begin{array}{lll} \int_{x_1^r(d)}\exp\left(-\frac{\sigma_1^{2k_1^r}(d)+\sigma^2(d)}{2\sigma_1^{2k_1^r}(d)\sigma^2(d)}\left(x_1^r(d) -\frac{\sigma_1^{2k_1^r}(d)y-\sigma_1^{2k_1^r}(d)h(\theta)+\sigma^2(d)\mu_1^{k_1^r}(d)}{\sigma_1^{2k_1^r}(d)+\sigma^2(d)}\right)^2\right)\,dx_1(d) \sqrt{2\pi \frac{\sigma_1^{2k_1^r}(d)\sigma^2(d)}{\sigma_1^{2k_1^r}(d)+\sigma^2(d)}}. \end{array}$$
(33)

Substituting Eq. 33 into Eq. 32, we arrive at a Gaussian distribution in the form

$$\begin{array}{ll} p\left(x_1^r(d)|{\mathbf y},k_1^r,k_2^r\right)&= \frac{1}{\sqrt{2\pi \frac{\sigma_1^{2k_1^r}(d)\sigma^2(d)}{\sigma_1^{2k_1^r}(d)+\sigma^2(d)}}} \exp\left(-\frac{\bigl(x_1^r(d)-\frac{\sigma_1^{2k_1^r}(d)y(d)-\sigma_1^{2k_1^r}(d)h(\theta)+\sigma^2(d)\mu_1^{k_1^r}(d)}{\sigma_1^{2k_1^r}(d)+\sigma^2(d)}\bigr)^2} {2\frac{\sigma_1^{2k_1^r}(d)\sigma^2(d)}{\sigma_1^{2k_1^r}(d)+\sigma^2(d)}}\right)\nonumber\\ &= \mathcal{N}\left(\frac{\sigma_1^{2k_1^r}(d)y(d)-\sigma_1^{2k_1^r}(d)h(\theta)+\sigma^2(d)\mu_1^{k_1^r}(d)}{\sigma_1^{2k_1^r}(d)+\sigma^2(d)},\frac{\sigma_1^{2k_1^r}(d)\sigma^2(d)}{\sigma_1^{2k_1^r}(d)+\sigma^2(d)}\right). \end{array}$$
(34)

In a similar fashion, under condition II Eq. 28 reduces to

$$\begin{aligned} & p\left(x_1^r(d)|{\mathbf y},k_1^r,k_2^r\right) \\ & \quad =\frac{ \int_{x_2^r(d)} \exp\left(\frac{-\bigl(y(d)-x_2^r(d)-h(-\theta)\bigr)^2}{2\sigma^2(d)}\right) \exp\left(\frac{-(x_2^r(d)-\mu_2^{k_2^r}(d))^2}{2\sigma_2^{2k_2^r}(d)}\right)\, dx_2(d)}{\int_{x_2^r(d)} \exp\left(\frac{-\bigl(y(d)-x_2^r(d)\bigr)^2}{2\sigma^2(d)}\right) \exp\left(\frac{-\bigl(x_2^r(d)-\mu_2^{k_2^r}(d)\bigr)^2}{2\sigma_2^{2k_2^r}(d)}\right)\,dx_2(d)}\times\frac{ \exp\left(\frac{-\bigl(x_1^r(d)-\mu_1^{k_1^r}(d)\bigr)^2}{2\sigma_1^{2k_1^r}(d)}\right) }{ \int_{x_1^r(d)} \exp\left(\frac{-\bigl(x_1^r(d)-\mu_1^{k_1^r}(d)\bigr)^2}{2\sigma_1^{2k_1^r}(d)}\right)\,dx_1(d)}\\ & \quad = \frac{\exp\left(\frac{-\bigl(x_1^r(d)-\mu_1^{k_1^r}(d)\bigr)^2}{2\sigma_1^{2k_1^r}(d)}\right)}{\int_{x_1^r(d)} \exp\left(\frac{-\bigl(x_1^r(d)-\mu_1^{k_1^r}(d)\bigr)^2}{2\sigma_1^{2k_1^r}(d)}\right)\,dx_1(d)}= \frac{1}{\sqrt{2\pi\sigma_1^{2k_1^r}(d)}}\exp\left(\frac{-\bigl(x_1^r(d)-\mu_1^{k_1^r}(d)\bigr)^2}{2\sigma_1^{2k_1^r}(d)}\right) =\mathcal{N}\left(\mu_{1d}^i,\sigma_1^{2k_1^r}(d)\right) \end{aligned}$$
(35)

Hence, Eqs. 34 and 35 give Eq. 23 in Section 3.

Appendix B: Quadratic Algorithm for Finding θ *

In this Appendix, the procedure for finding θ * given in Eq. 22 is described. The goal is to find a value at which the global maximum of Q(θ) occurs. Since Q(θ) is well-approximated by a convex function, a quadratic optimization approach can be used to find θ * with small numbers of iterations. (see Table 1 for the numerical results). Figure 10 shows a typical form of Q(θ) which eases the understanding of the algorithm presented in Table 2. Concisely speaking, the algorithm works as follows. First, the coordinates of three points of Q(θ) is determined, lets say {(θ l ,A), (θ c ,C),(θ r ,B)}. Then, a quadratic function of the form f(x) = ax 2 + bx + c is fitted to these three points and \(x^*=-\frac{b}{2a}\) which maximizes f(x) is obtained in terms of {(θ l ,A), (θ c ,C),(θ r ,B)} using a function called Quadratic(·), that is, \(x^*=Quadratic(\theta_l,A,\theta_r,B,\theta_c,C)\). Next, using x * and Q(x *), the coordinates {(θ l ,A), (θ c ,C),(θ r ,B)} are updated until a value of θ c is reached such that Q(θ l ) ≤ Q(θ c ) ≥ Q(θ r ).

Figure 10
figure 10

Typical form of Q(θ) with three marked points corresponding to A = Q(θ l ), C = Q(θ c ), and B = Q(θ r ).

Table 2 Quadratic optimization algorithm

Appendix C: Derivation of (10)

In this appendix, we show how the computation of the right-hand side of Eq. 9 leads to Eq. 10, that is

$$ E\Bigl(y^r(d)\big|x_1^r(d),x_2^r(d),\theta\Bigr) =\max\Bigl(x_1^r(d)+h(\theta),x_2^r(d)+h(-\theta)\Bigr). $$
(36)

The procedure is similar to that presented in [42] except that the source signal gains, g 1 and g 2, are incorporated. Let

$$ \begin{array}{ll} {\dot{{\mathbf x}}^r_i} & {=\Bigl|\mathcal{F}_D\Bigl(\{X_i(t)\}_{t=(r-1)M}^{(r-1)M+N-1}\Bigr)\Bigr|} \\ {} &{=\left[\dot{x}^r_i(0),\ldots,\dot{x}^r_i(d),\ldots,\dot{x}^r_i(D-1)\right]^{\top},} \end{array}$$

and

$$ \begin{array}{ll} \dot{{\mathbf y}}^r&=\Bigl|\mathcal{F}_D\Bigl(\{Y_i(t)\}_{t=(r-1)M}^{(r-1)M+N-1}\Bigr)\Bigr| \\ &=\left[\dot{y}^r(0),\ldots,\dot{y}^r(d),\ldots,\dot{y}^r(D-1)\right] \end{array}$$

represent the magnitudes of the D-point discrete Fourier transforms of sources and the observation signal, respectively. Let \(\phi^r=\angle\mathcal{F}_{\!D}\Bigl(\!\{X_i(t)\}_{t=(r-1)M}^{(r-1)M+N-1}\!\Bigr)-\) \(\angle\mathcal{F}_{\!D}\Bigl(\!\{X_i(t)\}_{t=(r-1)M}^{(r-1)M+N-1}\!\Bigr)\!=\![\phi^r(0),\ldots,\phi^r(d),\ldots, \phi^r(D\!-1)\!]^{\top}\) where \(\angle\) denotes the phase operator. Given Eq. 2, the relation between the log magnitude of \(\dot{y}^r_i(d)\) and those of \(\dot{x}^r_i(d)\), i ∈ {1,2} is given by

$$ \begin{array}{ll} y^r(d)&=\log_{10}\dot{y}^r(d) \\ &=\frac{1}{2}\log_{10}\Bigl[ g_1^2\bigl(\dot{x}^r_1(d)\bigr)^2+g_2^2\bigl(\dot{x}^r_2(d)\bigr)^2 \\ & \qquad\qquad\, +2g_1g_2\dot{x}^r_1(d) \dot{x}^r_2(d)\cos\bigl(\mathrm{\phi}^r(d)\bigr)\Bigr]. \end{array}$$
(37)

The goal is to obtain the MMSE estimate of y r(d) given \(x_1^r(d)\), \(x_2^r(d)\), g 1, and g 2. Mathematically, the MMSE estimator is expressed as

$$ \hat{y}^r(d)=E\Bigl(y^r(d)\big|x_1^r(d),x_2^r(d),g_1,g_2\Bigr). $$
(38)

From Eq. 37 and initially assuming that \(\dot{x}_1^r(d)\),\(\dot{x}_2^r(d)\), g 1, and g 2 are given, the only random variable on the right-hand side of Eq. 37 is φ r(d) which, as shown in [42], can be modeled by a uniform distribution over the interval [ − π;π]; that is \(p\bigl(\phi^r(d)\bigr)=\frac{1}{2\pi}\) where \(p\bigl(\phi^r(d)\bigr)\) denotes the PDF of φ r(d). Therefore,

$$\begin{array}{lll} \hat{y}^r(d)&=&\frac{1}{\pi}\int_0^{\pi} \frac{1}{2}\log_{10}\, \Biggl[ g_1^2\bigl(\dot{x}^r_1(d)\bigr)^2+g_2^2\bigl(\dot{x}^r_2(d)\bigr)^2\nonumber\\ && +2g_1g_2\dot{x}^r_1(d) \dot{x}^r_2(d)\cos\bigl(\mathrm{\phi}^r(d)\bigr)\Biggr]d\phi^r(d). \end{array}$$
(39)

The above integration is computed using an integration table (e.g. [63, pp. 111]) and the result is

$$\begin{array}{rl} \hat{y}^r(d)&= \max \Bigl(\log_{10}\bigl( g_1\dot{x}_1^r(d)\bigr),\log_{10}\bigl(g_2\dot{x}_2^r(d)\bigr)\Bigr)\nonumber\\ d&=0,\ldots,D-1. \end{array}$$
(40)

Noting that log10 g 1 = h(θ), log10 g 2 = h( − θ), and \(\log_{10}\bigl(\dot{x}_i^r(d)\bigr)\!=\!x_i^r(d)\), i ∈ {1,2}, Eq. 40 can be rewritten as

$$ \hat{y}^r(d)=\max \bigl(x_1^r(d)+h(\theta),x_2^r(d)+h(-\theta)\bigr) $$
(41)

which is identical to Eq. 10.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Radfar, M.H., Dansereau, R.M. & Chan, WY. Monaural Speech Separation Based on Gain Adapted Minimum Mean Square Error Estimation. J Sign Process Syst 61, 21–37 (2010). https://doi.org/10.1007/s11265-008-0274-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-008-0274-7

Keywords

Navigation