Abstract
It remains an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music. In this paper, we present a method for this task with natural motions for the lips, facial expression, head pose, and eyes. Due to the coupling of mixed information for the human voice and backing music in common music audio signals, we design a decouple-and-fuse strategy to tackle the challenge. We first decompose the input music audio into a human voice stream and a backing music stream. Due to the implicit and complicated correlation between the two-stream input signals and the dynamics of the facial expressions, head motions, and eye states, we model their relationship with an attention scheme, where the effects of the two streams are fused seamlessly. Furthermore, to improve the expressivenes of the generated results, we decompose head movement generation in terms of speed and direction, and decompose eye state generation into short-term blinking and long-term eye closing, modeling them separately. We have also built a novel dataset, SingingFace, to support training and evaluation of models for this task, including future work on this topic. Extensive experiments and a user study show that our proposed method is capable of synthesizing vivid singing faces, qualitatively and quantitatively better than the prior state-of-the-art.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Availability of data and materials
The Song2Face dataset is publicly available at https://vcg.xmu.edu.cn/datasets/singingface/index.html.
References
Cudeiro, D.; Bolkart, T.; Laidlaw, C.; Ranjan, A.; Black, M. J. Capture, learning, and synthesis of 3D speaking styles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10093–10103, 2019.
Suwajanakorn, S.; Seitz, S. M.; Kemelmacher-Shlizerman, I. Synthesizing Obama: Learning lip sync from audio. ACM Transactions on Graphics Vol. 36, No. 4, Article No. 95, 2017.
Chen, L. L.; Cui, G. F.; Liu, C. L.; Li, Z.; Kou, Z. Y.; Xu, Y.; Xu, C. L. Talking-head generation with rhythmic head motion. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12354. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 35–51, 2020.
Yi, R.; Ye, Z. P.; Zhang, J. Y.; Bao, H. J.; Liu, Y. J. Audio-driven talking face video generation with learning-based personalized head pose. arXiv preprint arXiv:2002.10137, 2020.
Zhang, C. X.; Zhao, Y. F.; Huang, Y. F.; Zeng, M.; Ni, S. F.; Budagavi, M.; Guo, X. H. FACIAL: Synthesizing dynamic talking face with implicit attribute learning. arXiv preprint arXiv:2108.07938, 2021.
Ji, X. Y.; Zhou, H.; Wang, K.; Wu, W.; Loy, C. C.; Cao, X.; Xu, F. Audio-driven emotional video portraits. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14075–14084, 2021.
Thies, J.; Elgharib, M.; Tewari, A.; Theobalt, C.; Nießner, M. Neural voice puppetry: Audio-driven facial reenactment. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12361. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 716–731, 2020.
Zhou, H.; Sun, Y. S.; Wu, W.; Loy, C. C.; Wang, X. G.; Liu, Z. W. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4174–4184, 2021.
Ji, X. Y.; Zhou, H.; Wang, K.; Wu, W.; Loy, C. C.; Cao, X.; Xu, F. Audio-driven emotional video portraits. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14075–14084, 2021.
Marcos, S.; Gómez-García-Bermejo, J.; Zalama, E. A realistic, virtual head for human–computer interaction. Interacting With Computers Vol. 22, No. 3, 176–192, 2010.
Yu, J.; Wang, Z. F. A video, text, and speech-driven realistic 3-D virtual head for human-machine interface. IEEE Transactions on Cybernetics Vol. 45, No. 5, 977–988, 2015.
Pumarola, A.; Agudo, A.; Martinez, A. M.; Sanfeliu, A.; Moreno-Noguer, F. GANimation: Anatomically-aware facial animation from a single image. In: Computer Vision–ECCV 2018. Lecture Notes in Computer Science, Vol. 11214. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 835–851, 2018.
Zakharov, E.; Shysheya, A.; Burkov, E.; Lempitsky, V. Few-shot adversarial learning of realistic neural talking head models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9458–9467, 2019.
Zhang, Y. X.; Zhang, S. W.; He, Y.; Li, C.; Loy, C. C.; Liu, Z. W. One-shot face reenactment. arXiv preprint arXiv:1908.03251, 2019.
Si, S. J.; Wang, J. Z.; Qu, X. Y.; Cheng, N.; Wei, W. Q.; Zhu, X. H.; Xiao, J. Speech2Video: Cross-modal distillation for speech to video generation. arXiv preprint arXiv:2107.04806, 2021.
Wang, Z. P.; Liu, Z. X.; Chen, Z. Z.; Hu, H.; Lian, S. G. A neural virtual anchor synthesizer based on Seq 2Seq and GAN models. In: Proceedings of the IEEE International Symposium on Mixed and Augmented Reality Adjunct, 233–236, 2019.
Thies, J.; Zollhöfer, M.; Stamminger, M.; Theobalt, C.; Nießner, M. Face2Face: Real-time face capture and reenactment of RGB videos. arXiv preprint arXiv:2007.14808, 2020.
Bregler, C.; Covell, M.; Slaney, M. Video Rewrite: Driving visual speech with audio. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, 353–360, 1997.
Shimba, T.; Sakurai, R.; Yamazoe, H.; Lee, J. H. Talking heads synthesis from audio with deep neural networks. In: Proceedings of the IEEE/SICE International Symposium on System Integration, 100–105, 2015.
Prajwal, K. R.; Mukhopadhyay, R.; Namboodiri, V. P.; Jawahar, C. V. A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, 484–492, 2020.
Thies, J.; Elgharib, M.; Tewari, A.; Theobalt, C.; Nießner, M. Neural voice puppetry: Audio-driven facial reenactment. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12361. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 716–731, 2020.
Wen, X.; Wang, M.; Richardt, C.; Chen, Z. Y.; Hu, S. M. Photorealistic audio-driven video portraits. IEEE Transactions on Visualization and Computer Graphics Vol. 26, No. 12, 3457–3466, 2020.
Brand, M. Voice puppetry. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Technique, 21–28, 1999.
Ezzat, T.; Geiger, G.; Poggio, T. Trainable videorealistic speech animation. ACM Transactions on Graphics Vol. 21, No. 3, 388–398, 2002.
Wang, L. J.; Han, W.; Soong, F. K.; Huo, Q. Text driven 3D photo-realistic talking head. Interspeech No. August, 3307–3308, 2011.
Chen, L. L.; Maddox, R. K.; Duan, Z. Y.; Xu, C. L. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7824–7833, 2019.
Das, D.; Biswas, S.; Sinha, S.; Bhowmick, B. Speech-driven facial animation using cascaded GANs for learning of motion and texture. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12375. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 408–424, 2020.
Fried, O.; Tewari, A.; Zollhöfer, M.; Finkelstein, A.; Shechtman, E.; Goldman, D. B.; Genova, K.; Jin, Z. Y.; Theobalt, C.; Agrawala, M. Text-based editing of talking-head video. ACM Transactions on Graphics Vol. 38, No. 4, Article No. 68, 2019.
Zhou, H.; Liu, Y.; Liu, Z. W.; Luo, P.; Wang, X. G. Talking face generation by adversarially disentangled audio-visual representation. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 33, No. 1, 9299–9306, 2019.
Yao, X. W.; Fried, O.; Fatahalian, K.; Agrawala, M. Iterative text-based editing of talking-heads using neural retargeting. ACM Transactions on Graphics Vol. 40, No. 3, Article No. 20, 2021.
Guo, Y. D.; Chen, K. Y.; Liang, S.; Liu, Y. J.; Bao, H. J.; Zhang, J. Y. AD-NeRF: Audio driven neural radiance fields for talking head synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5764–5774, 2021.
Xie, T. Y.; Liao, L. C.; Bi, C.; Tang, B. L.; Yin, X.; Yang, J. F.; Wang, M. J.; Yao, J. L.; Zhang, Y.; Ma, Z. J. Towards realistic visual dubbing with heterogeneous sources. In: Proceedings of the 29th ACM International Conference on Multimedia, 1739–1747, 2021.
Zhang, C. X.; Ni, S. F.; Fan, Z. P.; Li, H. B.; Zeng, M.; Budagavi, M.; Guo, X. H. 3D talking face with personalized pose dynamics. IEEE Transactions on Visualization and Computer Graphics Vol. 29, No. 2, 1438–1449, 2023.
Zhang, C. X.; Zhao, Y. F.; Huang, Y. F.; Zeng, M.; Ni, S. F.; Budagavi, M.; Guo, X. H. FACIAL: Synthesizing dynamic talking face with implicit attribute learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3847–3856, 2021.
Li, L. C.; Wang, S. Z.; Zhang, Z. M.; Ding, Y.; Zheng, Y. X.; Yu, X.; Fan, C. J. Write-a-speaker: Text-based emotional and rhythmic talking-head generation. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 3, 1911–1920, 2021.
Wang, S. Z.; Li, L. C.; Ding, Y.; Fan, C. J.; Yu, X. Audio2Head: Audio-driven one-shot talking-head generation with natural head motion. arXiv preprint arXiv:2107.09293, 2021.
Wang, S. Z.; Li, L. C.; Ding, Y.; Yu, X. One-shot talking face generation from single-speaker audiovisual correlation learning. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 36, No. 3, 2531–2539, 2022.
Cardle, M.; Barthe, L.; Brooks, S.; Robinson, P. Music-driven motion editing: Local motion transformations guided by music analysis. In: Proceedings of the 20th UK Conference on Eurographics, 2002.
Lee, M.; Lee, K.; Park, J. Music similarity-based approach to generating dance motion sequence. Multimedia Tools and Applications Vol. 62, No. 3, 895–912, 2013.
Shiratori, T.; Nakazawa, A.; Ikeuchi, K. Dancing-to-music character animation. Computer Graphics Forum Vol. 25, No. 3, 449–458, 2006.
Lee, J.; Kim, S.; Lee, K. Listen to Dance: Music-driven choreography generation using Autoregressive Encoder-Decoder Network. arXiv preprint arXiv:1811.00818, 2018.
Alemi, O.; Françoise, J.; Pasquier, P. GrooveNet: Realtime music-driven dance movement generation using artificial neural networks. networks Vol. 8, No. 17, 26, 2017.
Tang, T. R.; Jia, J.; Mao, H. Y. Dance with melody: An LSTM-autoencoder approach to music-oriented dance synthesis. In: Proceedings of the 26th ACM International Conference on Multimedia, 1598–1606, 2018.
Yalta, N.; Watanabe, S.; Nakadai, K.; Ogata, T. Weakly-supervised deep recurrent neural networks for basic dance step generation. In: Proceedings of the International Joint Conference on Neural Networks, 1–8, 2019.
Zhuang, W. L.; Wang, Y. G.; Robinson, J.; Wang, C. Y.; Shao, M.; Fu, Y.; Xia, S. Y. Towards 3D dance motion synthesis and control. arXiv preprint arXiv:2006.05743, 2020.
Kao, H. K.; Su, L. Temporally guided music-to-body-movement generation. In: Proceedings of the 28th ACM International Conference on Multimedia, 147–155, 2020.
Lee, H. Y.; Yang, X. D.; Liu, M. Y.; Wang, T. C.; Lu, Y. D.; Yang, M. H.; Kautz, J. Dancing to music. arXiv preprint arXiv:1911.02001, 2019.
Sun, G. F.; Wong, Y.; Cheng, Z. Y.; Kankanhalli, M. S.; Geng, W. D.; Li, X. D. DeepDance: Music-to-dance motion choreography with adversarial learning. IEEE Transactions on Multimedia Vol. 23, 497–509, 2021.
Huang, R. Z.; Hu, H.; Wu, W.; Sawada, K.; Zhang, M.; Jiang, D. X. Dance revolution: Long-term dance generation with music via curriculum learning. arXiv preprint arXiv:2006.06119, 2020.
Li, J. M.; Yin, Y. H.; Chu, H.; Zhou, Y.; Wang, T. W.; Fidler, S.; Li, H. Learning to generate diverse dance motions with transformer. arXiv preprint arXiv:2008.08171, 2020.
Li, R. L.; Yang, S.; Ross, D. A.; Kanazawa, A. AI choreographer: Music conditioned 3D dance generation with AIST++. arXiv preprint arXiv:2101.08779, 2021.
Ye, Z. J.; Wu, H. Z.; Jia, J.; Bu, Y. H.; Chen, W.; Meng, F. B.; Wang, Y. F. ChoreoNet: Towards music to dance synthesis with choreographic action unit. In: Proceedings of the 28th ACM International Conference on Multimedia, 744–752, 2020.
Iwase, S.; Kato, T.; Yamaguchi, S.; Yukitaka, T.; Morishima, S. Song2Face: Synthesizing singing facial animation from audio. In: Proceedings of the SIGGRAPH Asia 2020 Technical Communications, 1–4, 2020.
Pan, Y. F.; Landreth, C.; Fiume, E.; Singh, K. VOCAL: Vowel and consonant layering for expressive animatorcentric singing animation. In: Proceedings of the SIGGRAPH Asia 2022 Conference Papers, 1–9, 2022.
Sinha, S.; Biswas, S.; Bhowmick, B. Identity-preserving realistic talking face generation. In: Proceedings of the International Joint Conference on Neural Networks, 1–10, 2020.
Zhou, Y.; Han, X. T.; Shechtman, E.; Echevarria, J.; Kalogerakis, E.; Li, D. MakeltTalk: Speaker-aware talking-head animation. ACM Transactions on Graphics Vol. 39, No. 6, Article No. 221, 2020.
Hennequin, R.; Khlif, A.; Voituret, F.; Moussallam, M. Spleeter: A fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software Vol. 5, No. 50, 2154, 2020.
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7132–7141, 2018.
Li, Y. J.; Swersky, K.; Zemel, R. Generative moment matching networks. In: Proceedings of the 32nd International Conference on Machine Learning, 1718–1727, 2015.
Chung, J. S.; Zisserman, A. Out of time: Automated lip sync in the wild. In: Computer Vision–ACCV 2016 Workshops. Lecture Notes in Computer Science, Vol. 10117. Chen, C. S.; Lu, J.; Ma, K. K. Eds. Springer Cham, 251–263, 2017.
Deng, Y.; Yang, J. L.; Xu, S. C.; Chen, D.; Jia, Y. D.; Tong, X. Accurate 3D face reconstruction with weakly-supervised learning: From single image to image set. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 285–295, 2019.
Paysan, P.; Knothe, R.; Amberg, B.; Romdhani, S.; Vetter, T. A 3D face model for pose and illumination invariant face recognition. In: Proceedings of the 6th IEEE International Conference on Advanced Video and Signal Based Surveillance, 296–301, 2009.
Cao, C.; Weng, Y. L.; Zhou, S.; Tong, Y. Y.; Zhou, K. FaceWarehouse: A 3D facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics Vol. 20, No. 3, 413–425, 2014.
Baltrušaitis, T.; Robinson, P.; Morency, L. P. OpenFace: An open source facial behavior analysis toolkit. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 1–10, 2016.
Chen, L. L.; Li, Z. H.; Maddox, R. K.; Duan, Z. Y.; Xu, C. L. Lip movements generation at a glance. In: Computer Vision–ECCV 2018. Lecture Notes in Computer Science, Vol. 11211. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 538–553, 2018.
Zhang, Z. M.; Li, L. C.; Ding, Y.; Fan, C. J. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3660–3669, 2021.
Lu, Y. X.; Chai, J. X.; Cao, X. Live speech portraits: Real-time photorealistic talking-head animation. ACM Transactions on Graphics Vol. 40, No. 6, Article No. 220, 2021.
Chung, J. S.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018.
Chung, J. S.; Senior, A.; Vinyals, O.; Zisserman, A. Lip reading sentences in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3444–3453, 2017.
Ismail Fawaz, H.; Forestier, G.; Weber, J.; Idoumghar, L.; Muller, P. A. Deep learning for time series classification: A review. Data Mining and Knowledge Discovery Vol. 33, No. 4, 917–963, 2019.
Mao, X. D.; Li, Q.; Xie, H. R.; Lau, R. Y. K.; Wang, Z.; Smolley, S. P. Least Squares generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, 2813–2821, 2017.
Acknowledgements
This work was supported in part by grants from the National Key R&D Program of China (2021YFC3300403), National Natural Science Foundation of China (62072382), Yango Charitable Foundation, and the National Science Foundation (OAC-2007661).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Pengfei Liu is currently pursuing a master degree in the School of Informatics, Xiamen University, where he received his bachelor degree in 2021. His research interests lie in the generation of digital content, especially virtual humans and video conferencing.
Wenjin Deng is currently a postgraduate in the School of Informatics, Xiamen University, where he received his bachelor degree in 2020. His research interests include human pose estimation, face synthesis, and avatar animation.
Hengda Li is currently pursuing an M.S. degree in the School of Informatics, Xiamen University. He received his B.S. degree from the School of Computer and Data Science, Fuzhou University. His current research interests include face generation and face editing.
Jintai Wang is pursuing a master degree in Xiamen University, where he received his B.Eng. degree in 2022. His current research interests include neural radiance fields and computer vision.
Yinglin Zheng is pursuing a master degree in the School of Informatics, Xiamen University, where he received his bachelor degree in 2020. His research interests lie in human-related computer vision, especially face understanding and synthesis.
Yiwei Ding is currently a postgraduate in the School of Informatics, Xiamen University. His research interests include human pose estimation, text to speech, and virtual humans.
Xiaohu Guo is a full professor of computer science at the University of Texas at Dallas. He received his Ph.D. degree in computer science from Stony Brook University, and his B.S. degree in computer science from the University of Science and Technology of China. His research interests include computer graphics, computer vision, medical imaging, and VR/AR, with an emphasis on geometric modeling and processing, as well as body and face modeling problems. He received a prestigious NSF CAREER Award in 2012. For more information, please visit https://personal.utdallas.edu/∼xguo/.
Ming Zeng is currently an associate professor in the School of Informatics, Xiamen University. He was a visiting researcher in the Visual Computing Group, Microsoft Research Asia (MSRA) in 2017 and 2009–2011. He received his Ph.D. degree from the State Key Laboratory of CAD&CG, Zhejiang University. His research interests include computer graphics and computer vision, especially in human-centered analysis, reconstruction, synthesis, and animation.
Electronic Supplementary Material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.
About this article
Cite this article
Liu, P., Deng, W., Li, H. et al. MusicFace: Music-driven expressive singing face synthesis. Comp. Visual Media 10, 119–136 (2024). https://doi.org/10.1007/s41095-023-0343-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41095-023-0343-7