Open-Sora: Democratizing Efficient Video Production for All
We design and implement Open-Sora, an initiative dedicated to efficiently producing high-quality video. We hope to make the model, tools and all details accessible to all. By embracing open-source principles, Open-Sora not only democratizes access to advanced video generation techniques, but also offers a streamlined and user-friendly platform that simplifies the complexities of video generation. With Open-Sora, our goal is to foster innovation, creativity, and inclusivity within the field of content creation.
[中文文档] [潞晨云|OpenSora镜像|视频教程]
📰 News
- [2024.06.17] 🔥 We released Open-Sora 1.2, which includes 3D-VAE, rectified flow, and score condition. The video quality is greatly improved. [checkpoints] [report] [blog]
- [2024.04.25] 🤗 We released the Gradio demo for Open-Sora on Hugging Face Spaces.
- [2024.04.25] We released Open-Sora 1.1, which supports 2s~15s, 144p to 720p, any aspect ratio text-to-image, text-to-video, image-to-video, video-to-video, infinite time generation. In addition, a full video processing pipeline is released. [checkpoints] [report]
- [2024.03.18] We released Open-Sora 1.0, a fully open-source project for video generation. Open-Sora 1.0 supports a full pipeline of video data preprocessing, training with acceleration, inference, and more. Our model can produce 2s 512x512 videos with only 3 days training. [checkpoints] [blog] [report]
- [2024.03.04] Open-Sora provides training with 46% cost reduction. [blog]
🎥 Latest Demo
🔥 You can experience Open-Sora on our 🤗 Gradio application on Hugging Face. More samples and corresponding prompts are available in our Gallery.
OpenSora 1.0 Demo
Videos are downsampled to .gif
for display. Click for original videos. Prompts are trimmed for display,
see here for full prompts.
🔆 New Features/Updates
- 📍 Open-Sora 1.2 released. Model weights are available here. See our report 1.2 for more details.
- ✅ Support rectified flow scheduling.
- ✅ Support more conditioning including fps, aesthetic score, motion strength and camera motion.
- ✅ Trained our 3D-VAE for temporal dimension compression.
- 📍 Open-Sora 1.1 released. Model weights are available here. It is trained on 0s~15s, 144p to 720p, various aspect ratios videos. See our report 1.1 for more discussions.
- 🔧 Data processing pipeline v1.1 is released. An automatic processing pipeline from raw videos to (text, video clip) pairs is provided, including scene cutting $\rightarrow$ filtering(aesthetic, optical flow, OCR, etc.) $\rightarrow$ captioning $\rightarrow$ managing. With this tool, you can easily build your video dataset.
View more
- ✅ Improved ST-DiT architecture includes rope positional encoding, qk norm, longer text length, etc.
- ✅ Support training with any resolution, aspect ratio, and duration (including images).
- ✅ Support image and video conditioning and video editing, and thus support animating images, connecting videos, etc.
- 📍 Open-Sora 1.0 released. Model weights are available here. With only 400K video clips and 200 H800 days (compared with 152M samples in Stable Video Diffusion), we are able to generate 2s 512×512 videos. See our report 1.0 for more discussions.
- ✅ Three-stage training from an image diffusion model to a video diffusion model. We provide the weights for each stage.
- ✅ Support training acceleration including accelerated transformer, faster T5 and VAE, and sequence parallelism. Open-Sora improves 55% training speed when training on 64x512x512 videos. Details locates at acceleration.md.
- 🔧 Data preprocessing pipeline v1.0, including downloading, video cutting, and captioning tools. Our data collection plan can be found at datasets.md.
- ✅ We find VQ-VAE from VideoGPT has a low quality and thus adopt a better VAE from Stability-AI. We also find patching in the time dimension deteriorates the quality. See our report for more discussions.
- ✅ We investigate different architectures including DiT, Latte, and our proposed STDiT. Our STDiT achieves a better trade-off between quality and speed. See our report for more discussions.
- ✅ Support clip and T5 text conditioning.
- ✅ By viewing images as one-frame videos, our project supports training DiT on both images and videos (e.g., ImageNet & UCF101). See commands.md for more instructions.
- ✅ Support inference with official weights from DiT,