bakLlava-v1-hf

bakLlava-v1-hf项目介绍

项目背景

bakLlava-v1-hf项目是基于原始Llava架构开发的人工智能模型，采用Mistral-7b作为文字基础。这个项目的主要目标是实现图像与文本的互转能力，即可以通过模型生成与图像相关的文字描述，或者反之。

模型概述

bakLlava是一个7B级别的模型，利用LLaVA 1.5架构进行增强。在此版本中，Mistral 7B基座在多个基准测试中表现优于Llama 2 13B。该项目目前开源，并且正在不断更新中，以便为用户提供更为方便的微调和推理环境。不过需要注意的是，bakLlava-1中使用的一些数据集包括了LLaVA的语料库，这不是商业许可的，但在未来的版本中会进行解决。

bakLlava的未来发展

开发团队正在着手于bakLlava 2版本的研发，计划使用更大规模的（商业许可的）数据集以及新的架构，以突破当前bakLlava-1的限制，使得模型在商业应用中更加灵活和合法。

如何使用模型

bakLlava模型支持多图像和多提示生成。使用时需要确保安装了 transformers 版本大于等于4.35.3，并遵循一定的提示模板。例如，在提示中使用特定格式USER: xxx\nASSISTANT:，并添加<image>标记来指定需要查询图片的地方。

可以通过Google Colab演示运行该模型的实例，或者访问Spaces演示。

使用方法

借助`pipeline`库

如下使用pipeline库来处理图像到文本的转换：

from transformers import pipeline
from PIL import Image    
import requests

model_id = "llava-hf/bakLlava-v1-hf"
pipe = pipeline("image-to-text", model=model_id)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud"},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)

使用纯`transformers`

可以在GPU设备上运行的脚本如下：

import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "llava-hf/bakLlava-v1-hf"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "What are these?"},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))