See the Silence: Improving Visual-Only Voice Activity Detection by Optical Flow and RGB Fusion

Caus, Danu; Carbajal, Guillaume; Gerkmann, Timo; Frintrop, Simone

doi:10.1007/978-3-030-87156-7_4

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12899))

Included in the following conference series:

International Conference on Computer Vision Systems

Abstract

In this work, we propose a novel approach for visual voice activity detection (VAD), which is an important component of audio-visual tasks such as speech enhancement. We focus on optimizing the visual component and propose a two-stream approach based on optical flow and RGB data. Both streams are analyzed by long short-term memory (LSTM) modules to extract dynamic features. We show that this setup clearly improves the one without optical flow. Additionally, we show that focusing on the lower face area is superior to processing the whole face, or only the mouth region as usually done. This aspect involves practical advantages, since it facilitates data labeling. Our approach especially improves the true negative rate, which means we detect frames without speech more reliably—we see the silence.

This work was supported by the Alliance of Hamburg Universities for Computer Science (ahoi.digital) as part of the Adaptive Crossmodal Sensor Data Acquisition (ACSDA) research project and by the German Research Foundation (DFG) in the Transregional project Crossmodal Learning (TRR 169).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 7435; Price includes VAT (Japan)

Softcover Book: JPY 9294; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Voice activity detection based on facial movement

Article Open access 22 July 2015

Comparisons of Visual Activity Primitives for Voice Activity Detection

Real-Life Voice Activity Detection Based on Audio-Visual Alignment

References

Loizou, P.C.: Speech Enhancement: Theory and Practice. CRC Press, Boca Raton (2007)
Book Google Scholar
Verteletskaya, E., Sakhnov, K.: Voice activity detection for speech enhancement applications. Acta Polytechnica 50(4) (2010)
Google Scholar
Vincent, E., Virtanen, T., Gannot, S. (eds.): Audio Source Separation and Speech Enhancement. Wiley, Hoboken (2018)
Google Scholar
Liu, Q., Wang, W.: Blind source separation and visual voice activity detection for target speech extraction. In: 2011 3rd International Conference on Awareness Science and Technology (iCAST), pp. 457–460. IEEE (2011)
Google Scholar
Ramirez, J., Górriz, J.M., Segura, J.C.: Voice activity detection. Fundamentals and speech recognition system robustness. Robust Speech Recogn. Understanding 6(9), 1–22 (2007)
Google Scholar
Bratoszewski, P., Szwoch, G., Czyżewski, A.: Comparison of acoustic and visual voice activity detection for noisy speech recognition. In: 2016 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), pp. 287–291. IEEE (2016)
Google Scholar
Tao, F., Hansen, J.H., Busso, C.: An unsupervised visual-only voice activity detection approach using temporal orofacial features. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Ariav, I., Cohen, I.: An end-to-end multimodal voice activity detection using wavenet encoder and residual networks. IEEE J. Sel. Topics Signal Process. 13(2), 265–274 (2019)
Article Google Scholar
Harte, N., Gillen, E.: TCD-TIMIT: an audio-visual corpus of continuous speech. IEEE Trans. Multimedia 17(5), 603–615 (2015)
Article Google Scholar
Junqua, J.-C., Mak, B., Reaves, B.: A robust algorithm for word boundary detection in the presence of noise. IEEE Trans. Speech Audio Process. 2(3), 406–412 (1994)
Article Google Scholar
Sohn, J., Kim, N.S., Sung, W.: A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)
Article Google Scholar
Gerkmann, T., Breithaupt, C., Martin, R.: Improved a posteriori speech presence probability estimation based on a likelihood ratio with fixed priors. IEEE Trans. Audio Speech Lang. Process. 16(5), 910–919 (2008)
Article Google Scholar
Zhang, X.-L., Wu, J.: Deep belief networks based voice activity detection. IEEE Trans. Audio Speech Lang. Process. 21(4), 697–710 (2012)
Article Google Scholar
Leglaive, S., Hennequin, R., Badeau, R.: Singing voice detection with deep recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 121–125 (2015)
Google Scholar
Joosten, B., Postma, E., Krahmer, E.: Visual voice activity detection at different speeds. In: Auditory-Visual Speech Processing (AVSP) 2013 (2013)
Google Scholar
Chakravarty, P., Tuytelaars, T.: Cross-modal supervision for learning active speaker detection in video. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 285–301. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_18
Chapter Google Scholar
Guy, S., Lathuilière, S., Mesejo, P., Horaud, R.: Learning visual voice activity detection with an automatically annotated dataset. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4851–4856. IEEE (2021)
Google Scholar
Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. II-2017. IEEE (2002)
Google Scholar
Sharma, R., Somandepalli, K., Narayanan, S.: Toward visual voice activity detection for unconstrained videos. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 2991–2995. IEEE (2019)
Google Scholar
Shahid, M., Beyan, C., Murino, V.: S-VVAD: visual voice activity detection by motion segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2332–2341 (2021)
Google Scholar
Petsatodis, T., Pnevmatikakis, A., Boukis, C.: Voice activity detection using audio-visual information. In: 2009 16th International Conference on Digital Signal Processing, pp. 1–5. IEEE (2009)
Google Scholar
Tao, F., Hansen, J.H., Busso, C.: Improving boundary estimation in audiovisual speech activity detection using Bayesian information criterion. In: INTERSPEECH, pp. 2130–2134 (2016)
Google Scholar
Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–326 (2016)
Google Scholar
King, D.E.: DLIB-ML: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Google Scholar
Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45103-X_50
Chapter Google Scholar
Bresenham, J.E.: Algorithm for computer control of a digital plotter. IBM Syst. J. 4(1), 25–30 (1965)
Article Google Scholar
Salishev, S., Barabanov, A., Kocharov, D., Skrelin, P., Moiseev, M.: Voice activity detector (VAD) based on long-term Mel frequency band features. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 352–358. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_40
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, University of Hamburg, Hamburg, Germany
Danu Caus, Guillaume Carbajal, Timo Gerkmann & Simone Frintrop

Authors

Danu Caus
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Carbajal
View author publications
You can also search for this author in PubMed Google Scholar
Timo Gerkmann
View author publications
You can also search for this author in PubMed Google Scholar
Simone Frintrop
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Danu Caus .

Editor information

Editors and Affiliations

TU Wien, Vienna, Austria
Markus Vincze
University of Technology Sydney, Sydney, Australia
Timothy Patten
University of California San Diego, La Jolla, CA, USA
Henrik I Christensen
Technical University of Denmark, Kongens Lyngby, Denmark
Lazaros Nalpantidis
Hong Kong University of Science and Technology, Hong Kong, China
Ming Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Caus, D., Carbajal, G., Gerkmann, T., Frintrop, S. (2021). See the Silence: Improving Visual-Only Voice Activity Detection by Optical Flow and RGB Fusion. In: Vincze, M., Patten, T., Christensen, H.I., Nalpantidis, L., Liu, M. (eds) Computer Vision Systems. ICVS 2021. Lecture Notes in Computer Science(), vol 12899. Springer, Cham. https://doi.org/10.1007/978-3-030-87156-7_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-87156-7_4
Published: 19 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87155-0
Online ISBN: 978-3-030-87156-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

See the Silence: Improving Visual-Only Voice Activity Detection by Optical Flow and RGB Fusion

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Voice activity detection based on facial movement

Comparisons of Visual Activity Primitives for Voice Activity Detection

Real-Life Voice Activity Detection Based on Audio-Visual Alignment

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

See the Silence: Improving Visual-Only Voice Activity Detection by Optical Flow and RGB Fusion

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Voice activity detection based on facial movement

Comparisons of Visual Activity Primitives for Voice Activity Detection

Real-Life Voice Activity Detection Based on Audio-Visual Alignment

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation