Abstract
Speaker extraction to separate the target speech from the mixed audio is a problem worth studying in the speech separation field. Since human pronunciation is closely related to lip motions and facial expressions during speaking, this paper focuses on lip motions and their relationship to pronunciation and proposes a multi-scale audio-visual association representation network for end-to-end speaker extraction (MAVAR-SE). Moreover, multi-scale feature extraction and jump connection are used to solve the problem of information loss due to the lack of memory ability of convolution. This method is not limited by the number of speakers in the mixed speech and does not require prior knowledge such as speech features of related speakers, to realize speaker-independent multi-modal time domain speaker extraction. Compared with other recent methods on VoxCeleb2 and LRS2 data sets, the proposed method shows better results and robustness.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Comon, P.: Independent component analysis, a new concept? Signal Process. 36(3), 287–314 (1994)
Davies, M.E., James, C.J.: Source separation using single channel ICA. Signal Process. 87(8), 1819–1832 (2007)
Zhang, K., Wei, Y., Wu, D.: Adaptive speech separation based on beamforming and frequency domain-independent component analysis. Appl. Sci. 10, 7 (2020)
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Virtanen, T.: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15(3), 1066–1074 (2007)
Hsu, C.C., Chi, T.S., Chien, J.T.: Discriminative layered nonnegative matrix factorization for speech separation. In: Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH 2016), pp. 560–564 (2016)
Parsons, T.W.: Separation of speech from interfering speech by means of harmonic selection. J. Acoust. Soc. Am. 60(4), 911–918 (1976)
Li, H., Wang, Y., Zhao, R., Zhang, X.: An unsupervised two-talker speech separation system based on CASA. Int. J. Pattern Recognit. Artif. Intell. 32, 7 (2018)
Bao, F., Abdulla, W.H.: A new ratio mask representation for CASA-based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 7–19 (2018)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., Alsaadi, F.E.: A survey of deep neural network architectures and their applications. Neurocomputing 234, 11–26 (2017)
Wang, Y., Wang, D.: Boosting classification based speech separation using temporal dynamics. In: Proceedings of the 13th Annual Conference of the International Speech Communication Association (INTERSPEECH 2012), pp. 1526–1529 (2012)
Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35. IEEE (2016)
Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245. IEEE (2017)
Xu, C., Rao, W., Chng, E.S., Li, H.: Spex: multi-scale time domain speaker extraction network. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1370–1384 (2020)
Ge, M., Xu, C., Wang, L., Chng, E.S., Li, H.: Spex+: a complete time domain speaker extraction network. In: Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020), pp. 1406–1410 (2020)
Gu, R., Zhang, S.X., Xu, Y., Chen, L., Zou, Y., Yu, D.: Multi-modal multi-channel target speech separation. IEEE J. Select. Top. Signal Process. 14(3), 530–541 (2020)
Gogate, M., Dashtipour, K., Adeel, A., Hussain, A.: CochleaNet: a robust language-independent audio-visual model for real-time speech enhancement. Information Fusion 63, 273–285 (2020)
Yu, J., Zhang, S.X., Wu, J., Ghorbani, S., Wu, B., Kang, S., et al.: Audio-visual recognition of overlapped speech for the lrs2 dataset. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6984–6988. IEEE (2020)
Hou, J.C., Wang, S.S., Lai, Y.H., Tsao, Y., Chang, H.W., Wang, H.M.: Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Trans. Emerg. Topics Comput. Intell. 2(2), 117–128 (2018)
Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph. 37(4), 1–11 (2018)
Xu, Y., Du, J., Dai, L.R., Lee, C.H.: An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2013)
Isik, Y., Roux, J.L., Chen, Z., Watanabe, S., Hershey, J.R.: Single-channel multi-speaker separation using deep clustering. In: Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH 2016), pp. 545–549 (2016)
Wang, Z.Q., Roux, J.L., Wang, D., Hershey, J.R.: End-to-end speech separation with unfolded iterative phase reconstruction. In: Proceedings of the 19th Annual Conference of the International Speech Communication Association (INTERSPEECH 2018), pp. 2708–2712 (2018)
Gabbay, A., Ephrat, A., Halperin, T., Peleg, S.: Seeing through noise: visually driven speaker separation and enhancement. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3051–3055. IEEE (2018)
Luo, Y., Mesgarani, N.: Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700. IEEE (2018)
Luo, Y., Mesgarani, N.: Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Wu, J., et al.: IEEE automatic speech recognition and understanding workshop (ASRU). IEEE 2019, 667–673 (2019)
Pan, Z., Tao, R., Xu, C., Li, H.: Muse: multi-modal target speaker extraction with visual cues. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.6678–6682. IEEE (2021)
Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 8717–8727 (2018)
Le Roux, J., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR–half-baked or well done? In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630. IEEE (2019)
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: Proceedings of the19th Annual Conference of the International Speech Communication Association (INTERSPEECH 2018), pp. 1086–1090. Incheon (2018)
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)
Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. IEEE Trans. Audio Speech Lang. Process. 2, 749–752 (2001)
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4214–4217. IEEE (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yu, S., Yang, C. (2024). MAVAR-SE: Multi-scale Audio-Visual Association Representation Network for End-to-End Speaker Extraction. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14555. Springer, Cham. https://doi.org/10.1007/978-3-031-53308-2_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-53308-2_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53307-5
Online ISBN: 978-3-031-53308-2
eBook Packages: Computer ScienceComputer Science (R0)