Abstract
In this work, we propose a novel approach for visual voice activity detection (VAD), which is an important component of audio-visual tasks such as speech enhancement. We focus on optimizing the visual component and propose a two-stream approach based on optical flow and RGB data. Both streams are analyzed by long short-term memory (LSTM) modules to extract dynamic features. We show that this setup clearly improves the one without optical flow. Additionally, we show that focusing on the lower face area is superior to processing the whole face, or only the mouth region as usually done. This aspect involves practical advantages, since it facilitates data labeling. Our approach especially improves the true negative rate, which means we detect frames without speech more reliably—we see the silence.
This work was supported by the Alliance of Hamburg Universities for Computer Science (ahoi.digital) as part of the Adaptive Crossmodal Sensor Data Acquisition (ACSDA) research project and by the German Research Foundation (DFG) in the Transregional project Crossmodal Learning (TRR 169).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Loizou, P.C.: Speech Enhancement: Theory and Practice. CRC Press, Boca Raton (2007)
Verteletskaya, E., Sakhnov, K.: Voice activity detection for speech enhancement applications. Acta Polytechnica 50(4) (2010)
Vincent, E., Virtanen, T., Gannot, S. (eds.): Audio Source Separation and Speech Enhancement. Wiley, Hoboken (2018)
Liu, Q., Wang, W.: Blind source separation and visual voice activity detection for target speech extraction. In: 2011 3rd International Conference on Awareness Science and Technology (iCAST), pp. 457–460. IEEE (2011)
Ramirez, J., Górriz, J.M., Segura, J.C.: Voice activity detection. Fundamentals and speech recognition system robustness. Robust Speech Recogn. Understanding 6(9), 1–22 (2007)
Bratoszewski, P., Szwoch, G., Czyżewski, A.: Comparison of acoustic and visual voice activity detection for noisy speech recognition. In: 2016 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), pp. 287–291. IEEE (2016)
Tao, F., Hansen, J.H., Busso, C.: An unsupervised visual-only voice activity detection approach using temporal orofacial features. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Ariav, I., Cohen, I.: An end-to-end multimodal voice activity detection using wavenet encoder and residual networks. IEEE J. Sel. Topics Signal Process. 13(2), 265–274 (2019)
Harte, N., Gillen, E.: TCD-TIMIT: an audio-visual corpus of continuous speech. IEEE Trans. Multimedia 17(5), 603–615 (2015)
Junqua, J.-C., Mak, B., Reaves, B.: A robust algorithm for word boundary detection in the presence of noise. IEEE Trans. Speech Audio Process. 2(3), 406–412 (1994)
Sohn, J., Kim, N.S., Sung, W.: A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)
Gerkmann, T., Breithaupt, C., Martin, R.: Improved a posteriori speech presence probability estimation based on a likelihood ratio with fixed priors. IEEE Trans. Audio Speech Lang. Process. 16(5), 910–919 (2008)
Zhang, X.-L., Wu, J.: Deep belief networks based voice activity detection. IEEE Trans. Audio Speech Lang. Process. 21(4), 697–710 (2012)
Leglaive, S., Hennequin, R., Badeau, R.: Singing voice detection with deep recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 121–125 (2015)
Joosten, B., Postma, E., Krahmer, E.: Visual voice activity detection at different speeds. In: Auditory-Visual Speech Processing (AVSP) 2013 (2013)
Chakravarty, P., Tuytelaars, T.: Cross-modal supervision for learning active speaker detection in video. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 285–301. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_18
Guy, S., Lathuilière, S., Mesejo, P., Horaud, R.: Learning visual voice activity detection with an automatically annotated dataset. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4851–4856. IEEE (2021)
Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. II-2017. IEEE (2002)
Sharma, R., Somandepalli, K., Narayanan, S.: Toward visual voice activity detection for unconstrained videos. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 2991–2995. IEEE (2019)
Shahid, M., Beyan, C., Murino, V.: S-VVAD: visual voice activity detection by motion segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2332–2341 (2021)
Petsatodis, T., Pnevmatikakis, A., Boukis, C.: Voice activity detection using audio-visual information. In: 2009 16th International Conference on Digital Signal Processing, pp. 1–5. IEEE (2009)
Tao, F., Hansen, J.H., Busso, C.: Improving boundary estimation in audiovisual speech activity detection using Bayesian information criterion. In: INTERSPEECH, pp. 2130–2134 (2016)
Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–326 (2016)
King, D.E.: DLIB-ML: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45103-X_50
Bresenham, J.E.: Algorithm for computer control of a digital plotter. IBM Syst. J. 4(1), 25–30 (1965)
Salishev, S., Barabanov, A., Kocharov, D., Skrelin, P., Moiseev, M.: Voice activity detector (VAD) based on long-term Mel frequency band features. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 352–358. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_40
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Caus, D., Carbajal, G., Gerkmann, T., Frintrop, S. (2021). See the Silence: Improving Visual-Only Voice Activity Detection by Optical Flow and RGB Fusion. In: Vincze, M., Patten, T., Christensen, H.I., Nalpantidis, L., Liu, M. (eds) Computer Vision Systems. ICVS 2021. Lecture Notes in Computer Science(), vol 12899. Springer, Cham. https://doi.org/10.1007/978-3-030-87156-7_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-87156-7_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87155-0
Online ISBN: 978-3-030-87156-7
eBook Packages: Computer ScienceComputer Science (R0)