MAVAR-SE: Multi-scale Audio-Visual Association Representation Network for End-to-End Speaker Extraction | SpringerLink
Skip to main content

MAVAR-SE: Multi-scale Audio-Visual Association Representation Network for End-to-End Speaker Extraction

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14555))

Included in the following conference series:

  • 740 Accesses

Abstract

Speaker extraction to separate the target speech from the mixed audio is a problem worth studying in the speech separation field. Since human pronunciation is closely related to lip motions and facial expressions during speaking, this paper focuses on lip motions and their relationship to pronunciation and proposes a multi-scale audio-visual association representation network for end-to-end speaker extraction (MAVAR-SE). Moreover, multi-scale feature extraction and jump connection are used to solve the problem of information loss due to the lack of memory ability of convolution. This method is not limited by the number of speakers in the mixed speech and does not require prior knowledge such as speech features of related speakers, to realize speaker-independent multi-modal time domain speaker extraction. Compared with other recent methods on VoxCeleb2 and LRS2 data sets, the proposed method shows better results and robustness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Comon, P.: Independent component analysis, a new concept? Signal Process. 36(3), 287–314 (1994)

    Article  Google Scholar 

  2. Davies, M.E., James, C.J.: Source separation using single channel ICA. Signal Process. 87(8), 1819–1832 (2007)

    Article  Google Scholar 

  3. Zhang, K., Wei, Y., Wu, D.: Adaptive speech separation based on beamforming and frequency domain-independent component analysis. Appl. Sci. 10, 7 (2020)

    Google Scholar 

  4. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)

    Article  Google Scholar 

  5. Virtanen, T.: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15(3), 1066–1074 (2007)

    Article  Google Scholar 

  6. Hsu, C.C., Chi, T.S., Chien, J.T.: Discriminative layered nonnegative matrix factorization for speech separation. In: Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH 2016), pp. 560–564 (2016)

    Google Scholar 

  7. Parsons, T.W.: Separation of speech from interfering speech by means of harmonic selection. J. Acoust. Soc. Am. 60(4), 911–918 (1976)

    Article  Google Scholar 

  8. Li, H., Wang, Y., Zhao, R., Zhang, X.: An unsupervised two-talker speech separation system based on CASA. Int. J. Pattern Recognit. Artif. Intell. 32, 7 (2018)

    Article  Google Scholar 

  9. Bao, F., Abdulla, W.H.: A new ratio mask representation for CASA-based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 7–19 (2018)

    Article  Google Scholar 

  10. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  11. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)

    Article  Google Scholar 

  12. Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., Alsaadi, F.E.: A survey of deep neural network architectures and their applications. Neurocomputing 234, 11–26 (2017)

    Article  Google Scholar 

  13. Wang, Y., Wang, D.: Boosting classification based speech separation using temporal dynamics. In: Proceedings of the 13th Annual Conference of the International Speech Communication Association (INTERSPEECH 2012), pp. 1526–1529 (2012)

    Google Scholar 

  14. Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35. IEEE (2016)

    Google Scholar 

  15. Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245. IEEE (2017)

    Google Scholar 

  16. Xu, C., Rao, W., Chng, E.S., Li, H.: Spex: multi-scale time domain speaker extraction network. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1370–1384 (2020)

    Article  Google Scholar 

  17. Ge, M., Xu, C., Wang, L., Chng, E.S., Li, H.: Spex+: a complete time domain speaker extraction network. In: Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020), pp. 1406–1410 (2020)

    Google Scholar 

  18. Gu, R., Zhang, S.X., Xu, Y., Chen, L., Zou, Y., Yu, D.: Multi-modal multi-channel target speech separation. IEEE J. Select. Top. Signal Process. 14(3), 530–541 (2020)

    Article  Google Scholar 

  19. Gogate, M., Dashtipour, K., Adeel, A., Hussain, A.: CochleaNet: a robust language-independent audio-visual model for real-time speech enhancement. Information Fusion 63, 273–285 (2020)

    Article  Google Scholar 

  20. Yu, J., Zhang, S.X., Wu, J., Ghorbani, S., Wu, B., Kang, S., et al.: Audio-visual recognition of overlapped speech for the lrs2 dataset. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6984–6988. IEEE (2020)

    Google Scholar 

  21. Hou, J.C., Wang, S.S., Lai, Y.H., Tsao, Y., Chang, H.W., Wang, H.M.: Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Trans. Emerg. Topics Comput. Intell. 2(2), 117–128 (2018)

    Article  Google Scholar 

  22. Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph. 37(4), 1–11 (2018)

    Article  Google Scholar 

  23. Xu, Y., Du, J., Dai, L.R., Lee, C.H.: An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2013)

    Article  Google Scholar 

  24. Isik, Y., Roux, J.L., Chen, Z., Watanabe, S., Hershey, J.R.: Single-channel multi-speaker separation using deep clustering. In: Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH 2016), pp. 545–549 (2016)

    Google Scholar 

  25. Wang, Z.Q., Roux, J.L., Wang, D., Hershey, J.R.: End-to-end speech separation with unfolded iterative phase reconstruction. In: Proceedings of the 19th Annual Conference of the International Speech Communication Association (INTERSPEECH 2018), pp. 2708–2712 (2018)

    Google Scholar 

  26. Gabbay, A., Ephrat, A., Halperin, T., Peleg, S.: Seeing through noise: visually driven speaker separation and enhancement. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3051–3055. IEEE (2018)

    Google Scholar 

  27. Luo, Y., Mesgarani, N.: Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700. IEEE (2018)

    Google Scholar 

  28. Luo, Y., Mesgarani, N.: Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)

    Article  Google Scholar 

  29. Wu, J., et al.: IEEE automatic speech recognition and understanding workshop (ASRU). IEEE 2019, 667–673 (2019)

    Google Scholar 

  30. Pan, Z., Tao, R., Xu, C., Li, H.: Muse: multi-modal target speaker extraction with visual cues. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.6678–6682. IEEE (2021)

    Google Scholar 

  31. Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 8717–8727 (2018)

    Article  Google Scholar 

  32. Le Roux, J., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR–half-baked or well done? In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630. IEEE (2019)

    Google Scholar 

  33. Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: Proceedings of the19th Annual Conference of the International Speech Communication Association (INTERSPEECH 2018), pp. 1086–1090. Incheon (2018)

    Google Scholar 

  34. Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)

    Article  Google Scholar 

  35. Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. IEEE Trans. Audio Speech Lang. Process. 2, 749–752 (2001)

    Google Scholar 

  36. Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4214–4217. IEEE (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenhui Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yu, S., Yang, C. (2024). MAVAR-SE: Multi-scale Audio-Visual Association Representation Network for End-to-End Speaker Extraction. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14555. Springer, Cham. https://doi.org/10.1007/978-3-031-53308-2_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53308-2_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53307-5

  • Online ISBN: 978-3-031-53308-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics