llava-v1.6-vicuna-13b-hf

项目介绍：LLaVa-v1.6-vicuna-13b-hf

LLaVa-v1.6-vicuna-13b-hf是一个结合大型语言模型与视觉编码器的多模态对话机器人项目。此次更新的版本由Haotian Liu、Chunyuan Li、Yuheng Li、Bo Li、Yuanhan Zhang、Sheng Shen和Yong Jae Lee提出，旨在提升图像识别与文本生成的精准性。

项目背景与改进

LLaVA-NeXT（又称为LLaVA-1.6）在之前版本LLaVA-1.5的基础上进行了改进，主要增强了输入图像的分辨率，并在一个更为优质的视觉指令数据集上进行训练。其目标是在光学字符识别（OCR）和常识推理方面实现更高的表现。

模型描述

LLaVa模型将经过预训练的语言模型与视觉编码器结合，用于多模态聊天机器人场景。LLaVA 1.6在以下几方面优于1.5版本：

使用了更为多样且高质量的数据集。
采用了动态高分辨率技术，增强了图像输入处理。

应用场景与限制

该模型适用于图像字幕生成、视觉问答与多模态聊天机器人等任务。用户可在模型库中查找感兴趣的任务并下载对应版本。

如何使用

以下是使用该模型的提示模板：

"一个好奇的人类与一个人工智能助手之间的对话。助手为人类的问题提供有用、详细和礼貌的回答。用户：<image>\n这张图片中展示了什么？助手："

使用示例如下：

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-vicuna-13b-hf")

model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-vicuna-13b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True) 
model.to("cuda:0")

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "这张图片中展示了什么？"},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda:0")

output = model.generate(**inputs, max_new_tokens=100)

print(processor.decode(output[0], skip_special_tokens=True))

模型优化

使用`bitsandbytes`库进行4位量化

首先需要安装bitsandbytes，并确保使用兼容CUDA的GPU设备。代码修改如下：

model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   load_in_4bit=True
)

使用Flash-Attention 2加快生成速度

首先确保安装flash-attn。可参考Flash Attention的原始仓库了解相关安装步骤。代码修改如下：

model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   use_flash_attention_2=True
).to(0)

引用信息

如果需要引用该项目，请使用以下BibTeX条目：

@misc{liu2023improved,
      title={Improved Baselines with Visual Instruction Tuning}, 
      author={Haotian Liu and Chunyuan Li and Yuheng Li and Yong Jae Lee},
      year={2023},
      eprint={2310.03744},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}