See the Silence: Improving Visual-Only Voice Activity Detection by Optical Flow and RGB Fusion | SpringerLink
Skip to main content

See the Silence: Improving Visual-Only Voice Activity Detection by Optical Flow and RGB Fusion

  • Conference paper
  • First Online:
Computer Vision Systems (ICVS 2021)

Abstract

In this work, we propose a novel approach for visual voice activity detection (VAD), which is an important component of audio-visual tasks such as speech enhancement. We focus on optimizing the visual component and propose a two-stream approach based on optical flow and RGB data. Both streams are analyzed by long short-term memory (LSTM) modules to extract dynamic features. We show that this setup clearly improves the one without optical flow. Additionally, we show that focusing on the lower face area is superior to processing the whole face, or only the mouth region as usually done. This aspect involves practical advantages, since it facilitates data labeling. Our approach especially improves the true negative rate, which means we detect frames without speech more reliably—we see the silence.

This work was supported by the Alliance of Hamburg Universities for Computer Science (ahoi.digital) as part of the Adaptive Crossmodal Sensor Data Acquisition (ACSDA) research project and by the German Research Foundation (DFG) in the Transregional project Crossmodal Learning (TRR 169).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 7435
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 9294
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Loizou, P.C.: Speech Enhancement: Theory and Practice. CRC Press, Boca Raton (2007)

    Book  Google Scholar 

  2. Verteletskaya, E., Sakhnov, K.: Voice activity detection for speech enhancement applications. Acta Polytechnica 50(4) (2010)

    Google Scholar 

  3. Vincent, E., Virtanen, T., Gannot, S. (eds.): Audio Source Separation and Speech Enhancement. Wiley, Hoboken (2018)

    Google Scholar 

  4. Liu, Q., Wang, W.: Blind source separation and visual voice activity detection for target speech extraction. In: 2011 3rd International Conference on Awareness Science and Technology (iCAST), pp. 457–460. IEEE (2011)

    Google Scholar 

  5. Ramirez, J., Górriz, J.M., Segura, J.C.: Voice activity detection. Fundamentals and speech recognition system robustness. Robust Speech Recogn. Understanding 6(9), 1–22 (2007)

    Google Scholar 

  6. Bratoszewski, P., Szwoch, G., Czyżewski, A.: Comparison of acoustic and visual voice activity detection for noisy speech recognition. In: 2016 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), pp. 287–291. IEEE (2016)

    Google Scholar 

  7. Tao, F., Hansen, J.H., Busso, C.: An unsupervised visual-only voice activity detection approach using temporal orofacial features. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  8. Ariav, I., Cohen, I.: An end-to-end multimodal voice activity detection using wavenet encoder and residual networks. IEEE J. Sel. Topics Signal Process. 13(2), 265–274 (2019)

    Article  Google Scholar 

  9. Harte, N., Gillen, E.: TCD-TIMIT: an audio-visual corpus of continuous speech. IEEE Trans. Multimedia 17(5), 603–615 (2015)

    Article  Google Scholar 

  10. Junqua, J.-C., Mak, B., Reaves, B.: A robust algorithm for word boundary detection in the presence of noise. IEEE Trans. Speech Audio Process. 2(3), 406–412 (1994)

    Article  Google Scholar 

  11. Sohn, J., Kim, N.S., Sung, W.: A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)

    Article  Google Scholar 

  12. Gerkmann, T., Breithaupt, C., Martin, R.: Improved a posteriori speech presence probability estimation based on a likelihood ratio with fixed priors. IEEE Trans. Audio Speech Lang. Process. 16(5), 910–919 (2008)

    Article  Google Scholar 

  13. Zhang, X.-L., Wu, J.: Deep belief networks based voice activity detection. IEEE Trans. Audio Speech Lang. Process. 21(4), 697–710 (2012)

    Article  Google Scholar 

  14. Leglaive, S., Hennequin, R., Badeau, R.: Singing voice detection with deep recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 121–125 (2015)

    Google Scholar 

  15. Joosten, B., Postma, E., Krahmer, E.: Visual voice activity detection at different speeds. In: Auditory-Visual Speech Processing (AVSP) 2013 (2013)

    Google Scholar 

  16. Chakravarty, P., Tuytelaars, T.: Cross-modal supervision for learning active speaker detection in video. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 285–301. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_18

    Chapter  Google Scholar 

  17. Guy, S., Lathuilière, S., Mesejo, P., Horaud, R.: Learning visual voice activity detection with an automatically annotated dataset. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4851–4856. IEEE (2021)

    Google Scholar 

  18. Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. II-2017. IEEE (2002)

    Google Scholar 

  19. Sharma, R., Somandepalli, K., Narayanan, S.: Toward visual voice activity detection for unconstrained videos. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 2991–2995. IEEE (2019)

    Google Scholar 

  20. Shahid, M., Beyan, C., Murino, V.: S-VVAD: visual voice activity detection by motion segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2332–2341 (2021)

    Google Scholar 

  21. Petsatodis, T., Pnevmatikakis, A., Boukis, C.: Voice activity detection using audio-visual information. In: 2009 16th International Conference on Digital Signal Processing, pp. 1–5. IEEE (2009)

    Google Scholar 

  22. Tao, F., Hansen, J.H., Busso, C.: Improving boundary estimation in audiovisual speech activity detection using Bayesian information criterion. In: INTERSPEECH, pp. 2130–2134 (2016)

    Google Scholar 

  23. Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–326 (2016)

    Google Scholar 

  24. King, D.E.: DLIB-ML: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)

    Google Scholar 

  25. Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45103-X_50

    Chapter  Google Scholar 

  26. Bresenham, J.E.: Algorithm for computer control of a digital plotter. IBM Syst. J. 4(1), 25–30 (1965)

    Article  Google Scholar 

  27. Salishev, S., Barabanov, A., Kocharov, D., Skrelin, P., Moiseev, M.: Voice activity detector (VAD) based on long-term Mel frequency band features. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 352–358. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_40

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Danu Caus .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Caus, D., Carbajal, G., Gerkmann, T., Frintrop, S. (2021). See the Silence: Improving Visual-Only Voice Activity Detection by Optical Flow and RGB Fusion. In: Vincze, M., Patten, T., Christensen, H.I., Nalpantidis, L., Liu, M. (eds) Computer Vision Systems. ICVS 2021. Lecture Notes in Computer Science(), vol 12899. Springer, Cham. https://doi.org/10.1007/978-3-030-87156-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87156-7_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87155-0

  • Online ISBN: 978-3-030-87156-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics