This is the official code for Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation.
Animating virtual avatars to make co-speech gestures facilitates various applications in human-machine interaction. The existing methods mainly rely on generative adversarial networks (GANs), which typically suffer from notorious mode collapse and unstable training, thus making it difficult to learn accurate audio-gesture joint distributions. In this work, we propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture), to effectively capture the cross-modal audio-to-gesture associations and preserve temporal coherence for high-fidelity audio-driven co-speech gesture generation. Specifically, we first establish the diffusion-conditional generation process on clips of skeleton sequences and audio to enable the whole framework. Then, a novel Diffusion Audio-Gesture Transformer is devised to better attend to the information from multiple modalities and model the long-term temporal dependency. Moreover, to eliminate temporal inconsistency, we propose an effective Diffusion Gesture Stabilizer with an annealed noise sampling strategy. Benefiting from the architectural advantages of diffusion models, we further incorporate implicit classifier-free guidance to trade off between diversity and gesture quality. Extensive experiments demonstrate that DiffGesture achieves state-of-the-art performance, which renders coherent gestures with better mode coverage and stronger audio correlations.
<img src='./misc/overview.jpg' width=800>Clone this repository and install packages:
git clone https://github.com/Advocate99/DiffGesture.git
pip install -r requirements.txt
Download pretrained fasttext model from here and put crawl-300d-2M-subword.bin
and crawl-300d-2M-subword.vec
at data/fasttext/
.
Download the autoencoder used for FGD which include the following:
For the TED Gesture Dataset, we use the pretrained Auto-Encoder model provided by Yoon et al. for better reproducibility the ckpt in the train_h36m_gesture_autoencoder folder.
For the TED Expressive Dataset, the pretrained Auto-Encoder model is provided here.
Save the models in output/train_h36m_gesture_autoencoder/gesture_autoencoder_checkpoint_best.bin
for TED Gesture, and output/TED_Expressive_output/AE-cos1e-3/checkpoint_best.bin
for TED Expressive.
Refer to HA2G to download the two datasets.
The pretrained models can be found here.
While the test metrics may vary slightly, overall, the training procedure with the given config files tends to yield similar performance results and normally outperforms all the comparison methods.
python scripts/train_ted.py --config=config/pose_diffusion_ted.yml
python scripts/train_expressive.py --config=config/pose_diffusion_expressive.yml
# synthesize short videos
python scripts/test_ted.py short
python scripts/test_expressive.py short
# synthesize long videos
python scripts/test_ted.py long
python scripts/test_expressive.py long
# metrics evaluation
python scripts/test_ted.py eval
python scripts/test_expressive.py eval
If you find our work useful, please kindly cite as:
@inproceedings{zhu2023taming,
title={Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation},
author={Zhu, Lingting and Liu, Xian and Liu, Xuanyu and Qian, Rui and Liu, Ziwei and Yu, Lequan},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={10544--10553},
year={2023}
}
If you are interested in Audio-Driven Co-Speech Gesture Generation, we would also like to recommend you to check out our other related works:
AI Excel全自动制表工具
AEE 在线 AI 全自动 Excel 编辑器,提供智能录入、自动公式、数据整理、图表生成等功能,高效处理 Excel 任务,提升办公效率。支持自动高亮数据、批量计算、不规则数据录入,适用于企业、教育、金融等多场景。
基于 UI-TARS 视觉语言模型的桌面应用,可通过自然语言控制计算机进行多模态操作。
UI-TARS-desktop 是一款功能强大的桌面应用,基于 UI-TARS(视觉语言模型)构建。它具备自然语言控制、截图与视觉识别、精确的鼠标键盘控制等功能,支持跨平台使用(Windows/MacOS),能提供实时反馈和状态显示,且数据完全本地处理,保障隐私安全。该应用集成了多种大语言模型和搜索方式,还可进行文件系统操作。适用于需要智能交互和自动化任务的场景,如信息检索、文件管理等。其提供了详细的文档,包括快速启动、部署、贡献指南和 SDK 使用说明等,方便开发者使用和扩展。
开源且先进的大规模视频生成模型项目
Wan2.1 是一个开源且先进的大规模视频生成模型项目,支持文本到图像、文本到视频、图像到视频等多种生成任务。它具备丰富的配置选项,可调整分辨率、扩散步数等参数,还能对提示词进行增强。使用了多种先进技术和工具,在视频和图像生成领域具有广泛应用前景,适合研究人员和开发者使用。
全流程 AI 驱动的数据可视化工具,助力用户轻松创作高颜值图表