Chat-UniVi - 统一视觉表示赋能大语言模型理解图像和视频

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 News

[2024/04/05] We've revised the temporal evaluation performance of video understanding, resulting in an actual model performance of 47.9 instead of the previously stated 57.8. We sincerely apologize for any inconvenience our oversight may have caused you.
[2024/04/05] Chat-UniVi has been selected as a Highlight paper at CVPR 2024! (Top 3% of 11532 submissions).
[2024/02/27] Our Chat-UniVi has been accepted by CVPR 2024!
[2024/01/05] We enhance the video loading code by introducing support for variable-length videos. This improvement involves eliminating the previous zero-filling operation on the video. We find that this updated video loading method significantly boosts performance (Results).
[2023/12/05] The visualization script is available at VISUALIZATION.md.
[2023/11/22] ⚡ The online demo is available at Hugging Face Demo. Welcome to try!
[2023/11/22] The processed data is available at DATA.md.
[2023/11/21] 💡 We release Chat-UniVi-13B. Our proposed unified visual representation framework greatly reduces the number of visual tokens, so you can train 13B unified image and video understanding models in full parameters directly on 8 A100 GPUs within 3 days. Chat-UniVi-13B has better performance (Results). The training code for Chat-UniVi-13B has been updated (TRAIN_AND_VALIDATE.md).
[2023/11/21] We provide inference code for video understanding and image understanding.
[2023/11/21] We enhance the video loading code by introducing support for variable-length videos. This improvement involves eliminating the previous zero-filling operation on the video. We find that this updated video loading method significantly boosts performance.
[2023/11/15] Code are available now! Welcome to watch 👀 this repository for the latest updates.

😮 Highlights

💡 Unified visual representation for image and video

We employ a set of dynamic visual tokens to uniformly represent images and videos. This representation framework empowers the model to efficiently utilize a limited number of visual tokens to simultaneously capture the spatial details necessary for images and the comprehensive temporal relationship required for videos.

🔥 Joint training strategy, making LLMs understand both image and video

Chat-UniVi is trained on a mixed dataset containing both images and videos, allowing direct application to tasks involving both mediums without requiring any modifications.

🤗 High performance, complementary learning with image and video

Extensive experimental results demonstrate that Chat-UniVi, as a unified model, consistently outperforms even existing methods exclusively designed for either images or videos.

⚡ Demo

Please change the model path on line 15 of the main_demo.py first. Then run the demo:

# For Chat-UniVi-7B
CUDA_VISIBLE_DEVICES=0 uvicorn main_demo_7B:app --host 0.0.0.0 --port 8888

# For Chat-UniVi-13B
CUDA_VISIBLE_DEVICES=0 uvicorn main_demo_13B:app --host 0.0.0.0 --port 8888

A conversation with both image and video

A conversation includes multiple videos

A conversation includes multiple images

A conversation includes the video

A conversation in Chinese

With translation API, our model can also support Chinese conversations. We will add code to support Chinese conversations in future updates.

🚀 Main Results

Image understanding

Following LLaVA, we report the relative scores to GPT-4 for instruction-following questions.

Methods	LLM	Conversation	Detail Description	Complex Reasoning	All
Chat-UniVi-7B	Vicuna-7B	84.1	74.2	93.7	84.2
Chat-UniVi-13B	Vicuna-13B	84.1	79.4	94.7	86.1

Video understanding

Following Video-ChatGPT, we report the relative scores between the output of the model and the ground truth, with the assistance of GPT. It is worth noting that the results reported in Video-ChatGPT span a range from 0 to 5. To standardize the metrics, we normalize all scores to a scale of 0 to 100.

Methods	LLM	Correct	Detail	Context	Temporal	Consistency
Chat-UniVi-7B	Vicuna-7B	57.8	58.2	69.2	47.9	56.2
Chat-UniVi-13B	Vicuna-13B	59.4	59.8	70.5	-	60.6

ScienceQA

We report both zero-shot and fine-tuning results on the ScienceQA test set.

Methods	LLM	Average	Subject			Context Modality			Grade
			NAT	SOC	LAN	TXT	IMG	NO	G1-6	G7-12
Chat-UniVi-7B	Vicuna-7B	88.78	88.50	93.03	85.91	88.51	85.97	88.15	88.88	88.60
Chat-UniVi-13B	Vicuna-13B	90.99	90.41	95.05	88.91	89.64	88.05	90.94	91.19	90.64

VideoQA

We follow the evaluation protocol in Video-ChatGPT, i.e., employing GPT-assisted evaluation to assess the capabilities of models.

Methods	LLM