Live2Diff: 基于视频扩散模型的单向注意力实时流式翻译

作者: Zhening Xing, Gereon Fox, Yanhong Zeng, Xingang Pan, Mohamed Elgharib, Christian Theobalt, Kai Chen † (†: 通讯作者)

介绍视频

发布

[2024/07/18] 我们发布了HuggingFace空间、代码和检查点。
[2024/07/22] 我们发布了Colab演示

待办事项

支持Colab

主要特点

具有预热机制的单向时序注意力
推理过程中使用多时间步KV缓存进行时序注意力计算
使用深度先验以获得更好的结构一致性
兼容DreamBooth和LoRA以实现多种风格
支持TensorRT

速度评估在Ubuntu 20.04.6 LTS和Pytorch 2.2.2环境下进行，使用RTX 4090 GPU和Intel(R) Xeon(R) Platinum 8352V CPU。去噪步骤设置为2。

分辨率	TensorRT	FPS
512 x 512	开	16.43
512 x 512	关	6.91
768 x 512	开	12.15
768 x 512	关	6.29

安装

步骤0：克隆此仓库和子模块

git clone https://github.com/open-mmlab/Live2Diff.git
# 或通过ssh
git clone git@github.com:open-mmlab/Live2Diff.git

cd Live2Diff
git submodule update --init --recursive

步骤1：创建环境

通过conda创建虚拟环境：

conda create -n live2diff python=3.10
conda activate live2diff

步骤2：安装PyTorch和xformers

选择适合您系统的版本。

# CUDA 11.8
pip install torch torchvision xformers --index-url https://download.pytorch.org/whl/cu118
# CUDA 12.1
pip install torch torchvision xformers --index-url https://download.pytorch.org/whl/cu121

更多详情请参考https://pytorch.org/。

步骤3：安装项目

如果您想使用TensorRT加速（我们推荐使用），可以通过以下命令安装：

# 对于cuda 11.x
pip install ."[tensorrt_cu11]"
# 对于cuda 12.x
pip install ."[tensorrt_cu12]"

否则，您可以通过以下命令安装：

pip install .

如果您想以开发模式安装（即"可编辑安装"），可以添加-e选项。

# 对于cuda 11.x
pip install -e ."[tensorrt_cu11]"
# 对于cuda 12.x
pip install -e ."[tensorrt_cu12]"
# 或
pip install -e .

步骤4：下载检查点和演示数据

下载StableDiffusion-v1-5

huggingface-cli download runwayml/stable-diffusion-v1-5 --local-dir ./models/Model/stable-diffusion-v1-5

从HuggingFace下载检查点并将其放在models文件夹下。
从MiDaS官方发布下载深度检测器并将其放在models文件夹下。
从civitAI申请下载令牌，然后通过脚本下载Dreambooths和LoRAs：

# 下载所有DreamBooth/Lora
bash scripts/download.sh all 你的令牌
# 或者下载你想使用的特定模型
bash scripts/download.sh disney 你的令牌

从OneDrive下载演示数据。

然后models文件夹的数据结构应该如下：

./
|-- models
|   |-- LoRA
|   |   |-- MoXinV1.safetensors
|   |   `-- ...
|   |-- Model
|   |   |-- 3Guofeng3_v34.safetensors
|   |   |-- ...
|   |   `-- stable-diffusion-v1-5
|   |-- live2diff.ckpt
|   `-- dpt_hybrid_384.pt
`--data
   |-- 1.mp4
   |-- 2.mp4
   |-- 3.mp4
   `-- 4.mp4

注意事项

上述安装步骤（如下载脚本）是针对Linux用户的，在Windows上未经充分测试。如果你遇到任何困难，请随时提出问题🤗。

快速开始

你可以尝试data目录下的示例。例如，

# 使用TensorRT加速，首次运行请耐心等待，可能需要20分钟以上
python test.py ./data/1.mp4 ./configs/disneyPixar.yaml --max-frames -1 --prompt "1man is talking" --output work_dirs/1-disneyPixar.mp4 --height 512 --width 512 --acceleration tensorrt

# 不使用TensorRT加速
python test.py ./data/2.mp4 ./configs/disneyPixar.yaml --max-frames -1 --prompt "1man is talking" --output work_dirs/1-disneyPixar.mp4 --height 512 --width 512 --acceleration none

你可以通过--num-inference-steps、--strength和--t-index-list调整去噪强度。更多细节请参考test.py。

故障排除

如果使用TensorRT时遇到CUDA内存不足错误，请尝试减少t-index-list或strength。使用TensorRT推理时，我们会维护一组用于kv缓存的缓冲区，这会消耗更多内存。减少t-index-list或strength可以减小kv缓存的大小，从而节省更多GPU内存。

实时视频到视频演示

在demo目录下有一个交互式的txt2img演示！

更多详情请参考demo/README.md。

人脸（网络摄像头输入）	动漫角色（屏幕视频输入）

致谢

本GitHub仓库中的视频和图像演示是使用LCM-LoRA生成的。StreamDiffusion中的流批处理用于模型加速。视频扩散模型的设计采用了AnimateDiff。我们使用了支持onnx导出的第三方MiDaS实现。我们的在线演示修改自Real-Time-Latent-Consistency-Model。

BibTex

如果你觉得有帮助，请考虑引用我们的工作：

@article{xing2024live2diff,
  title={Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models},
  author={Zhening Xing and Gereon Fox and Yanhong Zeng and Xingang Pan and Mohamed Elgharib and Christian Theobalt and Kai Chen},
  booktitle={arXiv preprint arxiv:2407.08701},
  year={2024}
}