MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

Wang, Sen; Zhang, Jiangning; Tan, Xin; Xie, Zhifeng; Wang, Chengjie; Ma, Lizhuang

Computer Science > Multimedia

arXiv:2403.02905 (cs)

[Submitted on 5 Mar 2024 (v1), last revised 24 Aug 2024 (this version, v3)]

Title:MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

Authors:Sen Wang, Jiangning Zhang, Xin Tan, Zhifeng Xie, Chengjie Wang, Lizhuang Ma

View PDF HTML (experimental)

Abstract:The body movements accompanying speech aid speakers in expressing their ideas. Co-speech motion generation is one of the important approaches for synthesizing realistic avatars. Due to the intricate correspondence between speech and motion, generating realistic and diverse motion is a challenging task. In this paper, we propose MMoFusion, a Multi-modal co-speech Motion generation framework based on the diffusion model to ensure both the authenticity and diversity of generated motion. We propose a progressive fusion strategy to enhance the interaction of inter-modal and intra-modal, efficiently integrating multi-modal information. Specifically, we employ a masked style matrix based on emotion and identity information to control the generation of different motion styles. Temporal modeling of speech and motion is partitioned into style-guided specific feature encoding and shared feature encoding, aiming to learn both inter-modal and intra-modal features. Besides, we propose a geometric loss to enforce the joints' velocity and acceleration coherence among frames. Our framework generates vivid, diverse, and style-controllable motion of arbitrary length through inputting speech and editing identity and emotion. Extensive experiments demonstrate that our method outperforms current co-speech motion generation methods including upper body and challenging full body.

Subjects:	Multimedia (cs.MM)
Cite as:	arXiv:2403.02905 [cs.MM]
	(or arXiv:2403.02905v3 [cs.MM] for this version)
	https://doi.org/10.48550/arXiv.2403.02905

Submission history

From: Sen Wang [view email]
[v1] Tue, 5 Mar 2024 12:13:18 UTC (9,011 KB)
[v2] Fri, 17 May 2024 08:55:54 UTC (4,868 KB)
[v3] Sat, 24 Aug 2024 00:29:50 UTC (6,705 KB)

Computer Science > Multimedia

Title:MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Multimedia

Title:MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators