MotionDirector - 自定义文本到视频模型的动作生成

MotionDirector: Motion Customization of Text-to-Video Diffusion Models

Rui Zhao · Yuchao Gu · Jay Zhangjie Wu · David Junhao Zhang · Jia-Wei Liu · Weijia Wu · Jussi Keppo · Mike Zheng Shou

Show Lab, National University of Singapore | Zhejiang University

MotionDirector can customize text-to-video diffusion models to generate videos with desired motions.

Task Definition

Motion Customization of Text-to-Video Diffusion Models:
Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate diverse videos with this motion.

Demos

Demo Video:

Customize both Appearance and Motion:

Reference images or videos	Videos generated by MotionDirector

Reference images for appearance customization: "A Terracotta Warrior on a pure color background."	"A Terracotta Warrior is riding a horse through an ancient battlefield." seed: 1455028	"A Terracotta Warrior is playing golf in front of the Great Wall." seed: 5804477	"A Terracotta Warrior is walking cross the ancient army captured with a reverse follow cinematic shot." seed: 653658

Reference videos for motion customization: "A person is riding a bicycle."	"A Terracotta Warrior is riding a bicycle past an ancient Chinese palace." seed: 166357.	"A Terracotta Warrior is lifting weights in front of the Great Wall." seed: 5635982	"A Terracotta Warrior is skateboarding." seed: 9033688

News

[2024.02.03] MotionDirector for AnimateDiff is available. Thanks to ExponentialML.
[2023.12.27] MotionDirector with Customized Appearance released. Now, you can customize both appearance and motion in video generation.
[2023.12.27] MotionDirector for Image Animation released.
[2023.12.23] MotionDirector has been featured in Hugging Face's 'Spaces of the Week 🔥' trending list!
[2023.12.13] Online gradio demo released @ Hugging Face Spaces! Welcome to try it.
[2023.12.06] MotionDirector for Sports released! Lifting weights, riding horse, palying golf, etc.
[2023.12.05] Colab demo is available. Thanks to Camenduru.
[2023.12.04] MotionDirector for Cinematic Shots released. Now, you can make AI films with professional cinematic shots!
[2023.12.02] Code and model weights released!
[2023.10.12] Paper and project page released.

ToDo

Gradio Demo
More trained weights of MotionDirector

Model List

Type	Training Data	Descriptions	Link
MotionDirector for Sports	Multiple videos for each model.	Learn motion concepts of sports, i.e. lifting weights, riding horse, palying golf, etc.	Link
MotionDirector for Cinematic Shots	A single video for each model.	Learn motion concepts of cinematic shots, i.e. dolly zoom, zoom in, zoom out, etc.	Link
MotionDirector for Image Animation	A single image for spatial path. And a single video or multiple videos for temporal path.	Animate the given image with learned motions.	Link
MotionDirector with Customized Appearance	A single image or multiple images for spatial path. And a single video or multiple videos for temporal path.	Customize both appearance and motion in video generation.	Link

Setup

Requirements

# create virtual environment
conda create -n motiondirector python=3.8
conda activate motiondirector
# install packages
pip install -r requirements.txt

Weights of Foundation Models

git lfs install
## You can choose the ModelScopeT2V or ZeroScope, etc., as the foundation model.
## ZeroScope
git clone https://huggingface.co/cerspense/zeroscope_v2_576w ./models/zeroscope_v2_576w/
## ModelScopeT2V
git clone https://huggingface.co/damo-vilab/text-to-video-ms-1.7b ./models/model_scope/

Weights of trained MotionDirector

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/ruizhaocv/MotionDirector_weights ./outputs

# More and better trained MotionDirector are released at a new repo:
git clone https://huggingface.co/ruizhaocv/MotionDirector ./outputs
# The usage is slightly different, which will be updated later.

Usage

Training

Train MotionDirector on multiple videos:

python MotionDirector_train.py --config ./configs/config_multi_videos.yaml

Train MotionDirector on a single video:

python MotionDirector_train.py --config ./configs/config_single_video.yaml

Note:

Before running the above command, make sure you replace the path to foundational model weights and training data with your own in the config files config_multi_videos.yaml or config_single_video.yaml.
Generally, training on multiple 16-frame videos usually takes 300~500 steps, about 9~16 minutes using one A5000 GPU. Training on a single video takes 50~150 steps, about 1.5~4.5 minutes using one A5000 GPU. The required VRAM for training is around 14GB.
Reduce n_sample_frames if your GPU memory is limited.
Reduce the learning rate and increase the training steps for better performance.

Inference

python MotionDirector_inference.py --model /path/to/the/foundation/model  --prompt "Your prompt" --checkpoint_folder /path/to/the/trained/MotionDirector --checkpoint_index 300 --noise_prior 0.

Note:

Replace /path/to/the/foundation/model with your own path to the foundation model, like ZeroScope.
The value of checkpoint_index means the checkpoint saved at which the training step is selected.
The value of noise_prior indicates how much the inversion noise of the reference video affects the generation. We recommend setting it to 0 for MotionDirector trained on multiple videos to achieve the highest diverse generation, while setting it to 0.1~0.5 for MotionDirector trained on a single video for faster convergence and better alignment with the reference video.

Inference with pre-trained MotionDirector

All available weights are at official Huggingface Repo. Run the download command, the weights will be downloaded to the folder outputs, then run the following inference command to generate videos.

MotionDirector trained on multiple videos:

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A person is riding a bicycle past the Eiffel Tower." --checkpoint_folder ./outputs/train/riding_bicycle/ --checkpoint_index 300 --noise_prior 0. --seed 7192280

Note:

Replace /path/to/the/ZeroScope with your own path to the foundation model, i.e. the ZeroScope.
Change the prompt to generate different videos.
The seed is set to a random value by default. Set it to a specific value will obtain certain results, as provided in the table below.

Results:

Reference Videos	Videos Generated by MotionDirector

"A person is riding a bicycle."	"A person is riding a bicycle past the Eiffel Tower.” seed: 7192280	"A panda is riding a bicycle in a garden." seed: ~~2178639~~	"An alien is riding a bicycle on Mars." seed: 2390886

MotionDirector trained on a single video:

16 frames:

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A tank is running on the moon." --checkpoint_folder ./outputs/train/car_16/ --checkpoint_index 150 --noise_prior 0.5 --seed 8551187

Reference Video	Videos Generated by MotionDirector

"A car is running on the road."	"A tank is running on the moon.” seed: 8551187	"A lion is running past the pyramids." seed: 431554	"A spaceship is flying past Mars." seed: 8808231

24 frames:

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A truck is running past the Arc de Triomphe." --checkpoint_folder ./outputs/train/car_24/ --checkpoint_index 150 --noise_prior 0.5 --width 576 --height 320 --num-frames 24 --seed 34543

Reference Video	Videos Generated by MotionDirector

"A car is running on the road."	"A truck is running past the Arc de Triomphe.” seed: 34543	"An elephant is running in a forest." seed: 2171736