Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
If you like our project, please give us a star ⭐ on GitHub for the latest update.
📣 News
- [2024/04/05] We've revised the temporal evaluation performance of video understanding, resulting in an actual model performance of 47.9 instead of the previously stated 57.8. We sincerely apologize for any inconvenience our oversight may have caused you.
- [2024/04/05] Chat-UniVi has been selected as a Highlight paper at CVPR 2024! (Top 3% of 11532 submissions).
- [2024/02/27] Our Chat-UniVi has been accepted by CVPR 2024!
- [2024/01/05] We enhance the video loading code by introducing support for variable-length videos. This improvement involves eliminating the previous zero-filling operation on the video. We find that this updated video loading method significantly boosts performance (Results).
- [2023/12/05] The visualization script is available at VISUALIZATION.md.
- [2023/11/22] ⚡ The online demo is available at Hugging Face Demo. Welcome to try!
- [2023/11/22] The processed data is available at DATA.md.
- [2023/11/21] 💡 We release Chat-UniVi-13B. Our proposed unified visual representation framework greatly reduces the number of visual tokens, so you can train 13B unified image and video understanding models in full parameters directly on 8 A100 GPUs within 3 days. Chat-UniVi-13B has better performance (Results). The training code for Chat-UniVi-13B has been updated (TRAIN_AND_VALIDATE.md).
- [2023/11/21] We provide inference code for video understanding and image understanding.
- [2023/11/21] We enhance the video loading code by introducing support for variable-length videos. This improvement involves eliminating the previous zero-filling operation on the video. We find that this updated video loading method significantly boosts performance.
- [2023/11/15] Code are available now! Welcome to watch 👀 this repository for the latest updates.
😮 Highlights
💡 Unified visual representation for image and video
We employ a set of dynamic visual tokens to uniformly represent images and videos. This representation framework empowers the model to efficiently utilize a limited number of visual tokens to simultaneously capture the spatial details necessary for images and the comprehensive temporal relationship required for videos.
🔥 Joint training strategy, making LLMs understand both image and video
Chat-UniVi is trained on a mixed dataset containing both images and videos, allowing direct application to tasks involving both mediums without requiring any modifications.
🤗 High performance, complementary learning with image and video
Extensive experimental results demonstrate that Chat-UniVi, as a unified model, consistently outperforms even existing methods exclusively designed for either images or videos.
⚡ Demo
Please change the model path on line 15 of the main_demo.py first. Then run the demo:
# For Chat-UniVi-7B
CUDA_VISIBLE_DEVICES=0 uvicorn main_demo_7B:app --host 0.0.0.0 --port 8888
# For Chat-UniVi-13B
CUDA_VISIBLE_DEVICES=0 uvicorn main_demo_13B:app --host 0.0.0.0 --port 8888
A conversation with both image and video
A conversation includes multiple videos
A conversation includes multiple images
A conversation includes the video
A conversation in Chinese
With translation API, our model can also support Chinese conversations. We will add code to support Chinese conversations in future updates.
🚀 Main Results
Image understanding
Following LLaVA, we report the relative scores to GPT-4 for instruction-following questions.
Methods | LLM | Conversation | Detail Description | Complex Reasoning | All |
---|---|---|---|---|---|
Chat-UniVi-7B | Vicuna-7B | 84.1 | 74.2 | 93.7 | 84.2 |
Chat-UniVi-13B | Vicuna-13B | 84.1 | 79.4 | 94.7 | 86.1 |
Video understanding
Following Video-ChatGPT, we report the relative scores between the output of the model and the ground truth, with the assistance of GPT. It is worth noting that the results reported in Video-ChatGPT span a range from 0 to 5. To standardize the metrics, we normalize all scores to a scale of 0 to 100.
Methods | LLM | Correct | Detail | Context | Temporal | Consistency |
---|---|---|---|---|---|---|
Chat-UniVi-7B | Vicuna-7B | 57.8 | 58.2 | 69.2 | 47.9 | 56.2 |
Chat-UniVi-13B | Vicuna-13B | 59.4 | 59.8 | 70.5 | - | 60.6 |
ScienceQA
We report both zero-shot and fine-tuning results on the ScienceQA test set.
Methods | LLM | Average | Subject | Context Modality | Grade | |||||
---|---|---|---|---|---|---|---|---|---|---|
NAT | SOC | LAN | TXT | IMG | NO | G1-6 | G7-12 | |||
Chat-UniVi-7B | Vicuna-7B | 88.78 | 88.50 | 93.03 | 85.91 | 88.51 | 85.97 | 88.15 | 88.88 | 88.60 |
Chat-UniVi-13B | Vicuna-13B | 90.99 | 90.41 | 95.05 | 88.91 | 89.64 | 88.05 | 90.94 | 91.19 | 90.64 |
VideoQA
We follow the evaluation protocol in Video-ChatGPT, i.e., employing GPT-assisted evaluation to assess the capabilities of models.
Methods | LLM |
---|