Abstract
This paper presents a front-end enhancement system for automatic speech recognition to address the cocktail party problem. Cocktail party problem is focus on recognizing the target speech when multiple speakers talk in the noisy real-environments. Many conventional techniques have been proposed. In this work, we propose a new framework to integrate the conventional blind source separation and minimum variance distortionless response beamformer for the speech enhancement and source separation of the recent CHiME-5 challenge. In our experiments, we found that the time–frequency (T–F) mask estimation strategy based on the BSS algorithm should be different for speech enhancement and source separation. The main difference is that whether we need to account for background noise as an additional class during T–F mask estimation. Experimental results showed that the proposed framework was very beneficial to improve the speech recognition performance on the Single-array-track of CHiME-5. We obtained relative 13.5% WER reduction than the official baseline system by only improving the front-end speech enhancement framework.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Araki, S., Sawada, H., Mukai, R., & Makino, S. (2007). Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors. Signal Processing, 87(8), 1833–1847.
Barker, J., Watanabe, S., Vincent, E., & Trmal, J. (2018). The CHiME speech separation and recognition challenge: Dataset, task and baselines. In Proc. interspeech (pp. 1561–1565).
Boeddeker, C., Heitkaemper, J., & Schmalenstroeer, J. (2018). Front-end processing for the CHiME-5 dinner party scenario. In CHiME5 Workshop.
CHiME Challenge. (2018). The 5th CHiME speech separation and recognition challenge. http://spandh.dcs.shef.ac.uk/chime_challenge/results.html
Chen, Z., Luo, Y., & Mesgarani, N. (2017). Deep attractor network for single-microphone speaker separation. In Proc. ICASSP (pp. 246–250).
Deng, L. (2016). Deep learning: From speech recognition to language and multimodal processing. APSIPA Transactions on Signal and Information Processing, 5, 1–15.
Drude, L., Boeddeker, C., Heymann, J., Haeb-Umbach, R., Ki-noshita, K., Delcroix, M., & Nakatani, T. (2018). Integrating neural net- work based beamforming and weighted prediction error dereverberation. In Proc. Interspeech (pp. 3043–3947).
Haykin, S., & Chen, Z. (2009). The cocktail party problem. Current Biology, 19(22), R1024–R1027.
Hershey, J. R., Chen, Z., & Le Roux, J. (2016). Deep clustering: Discriminative embeddings for segmentation and separation. In Proc. ICASSP (pp. 31–35).
Heymann, J., Drude, L., & Haeb-Umbach, R. (2016). Neural network based spectral mask estimation for acoustic beamforming. In Proc. ICASSP (pp. 196–200).
Higuchi, T., Ito, N., Yoshioka, T., & Nakatani, T. (2016). Robust MVDR beamforming using time-frequency masks for online/ofine ASR in noise. In Proc. ICASSP (pp. 5210–5214).
Isik, Y., Le Roux, J., Chen, Z., Watanabe, S., & Hershey, J. R. (2016). Single-channel multi-speaker separation using deep clustering. In Proc. Interspeech (pp. 545–549).
Ito, N., Araki, S., & Nakatani, T. (2016). Complex angular central Gaussian mixture model for directional statistics in mask-based microphone array signal processing. In Proc. EUSIPCO (pp. 1153–1157).
Jun, D., Tian, G., Lei, S., et al. (2018). The USTC-iFlytek system for CHiME-5 challenge. In CHiME-5 workshop.
Kolbk, M., Yu, D., Tan, Z. H., & Jensen, J. (2017). Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech and Language Processing, 25(10), 1901–1913.
Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, 13, 556–562.
Nakatani, T., Ito, N., Higuchi, T., Araki, S., & Kinoshita, K. (2017). Inergrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming. In Proc. ICASSP (pp. 286–290).
Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., Wang, S., & Khudanpur, Y. (2016). Purely sequence-trained neural networks for ASR based on lattice-free MMI. In Proc. Interspeech (pp. 2751–2755).
Povey, D., Ghoshal, A., Boulianne, G., & Burget, L., et al. (2011). The Kaldi speech recognition toolkit. In Proc. ASRU. Number EPFL-CONF-192584.
Sawada, H., Araki, S., & Makino, S. (2011). Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Transactions on Audio, Speech, and Language Processing, 19(3), 516–527.
Smaragdis, P. (1998). Blind separation of convolved mixtures in the frequency domain. Neuro Computing, 22(1), 21–34.
Tran Vu, D. H., & Haeb-Umbach, R. (2010). Blind speech separation employing directional statistics in an expectation maximization framework. In Proc. ICASSP (pp. 241–244).
Wang, D., & Brown, G. J. (2006). Computational auditory scene analysis: Principles, algorithms, and applications. Hoboken: Wiley-IEEE Press.
Yilmaz, O., & Rickard, S. (2004). Blind separation of speech mixtures via time frequency masking. IEEE Transactions on Signal Processing, 52(7), 1830–1847.
Yoshioka, T., Ito, N., Delcroix, M., Ogawa, A., et al. (2015). The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices. In Proc. ASRU (pp. 436–443).
Yu, D., Kolbak, M., Tan, Z. H., & Jensen, J. (2017). Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proc. ICASSP (pp. 241–245).
Yılmaz, O., & Rickard, S. (2004). Blind separation of speech mixtures via time-frequency masking. IEEE Transactions on Signal Processing, 52(7), 1830–1847.
Acknowledgements
This work was funded by the Project 61701306 supported by National Natural Science Foundation of China.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
He, R., Long, Y., Li, Y. et al. Mask-based blind source separation and MVDR beamforming in ASR. Int J Speech Technol 23, 133–140 (2020). https://doi.org/10.1007/s10772-019-09666-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-019-09666-x