InternVL2-2B-AWQ - 跨多语言多图像任务的高效视觉语言模型

InternVL2-2B-AWQ项目介绍

项目背景

InternVL2-2B-AWQ是一个提供图片到文字转换的多模态模型项目。它基于OpenGVLab的InternVL2-2B基础模型，利用了先进的量化技术来提升模型的推理速度。该项目不仅支持图像和视频的识别和描述，还能通过自定义代码实现更复杂的视觉和语言结合任务。

技术细节

InternVL2-2B-AWQ项目采用了一种称为AWQ的量化算法，这是INT4的权重量化方法。通过高性能的CUDA内核支持，4bit量化模型的推理速度比传统FP16计算快了2.4倍。这对于需要快速处理大规模数据的应用场景来说，极具吸引力。

支持的GPU型号

该项目支持以下NVIDIA的GPU型号进行W4A16推理：

Turing (sm75): 20系列, T4
Ampere (sm80, sm86): 30系列, A10, A16, A30, A100
Ada Lovelace (sm90): 40系列

在进行量化和推理之前，需要确保已经安装lmdeploy软件包。

pip install lmdeploy==0.5.3

推理功能

InternVL2-2B-AWQ提供支持批量离线推理的功能，可以通过以下示例代码进行尝试：

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2-2B-AWQ'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
backend_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline(model, backend_config=backend_config, log_level='INFO')
response = pipe(('describe this image', image))
print(response.text)

有关更多管道参数的信息，请参阅官方的文档。

服务部署

使用LMDeploy的api_server，可以通过一行命令轻松将模型打包成服务。它提供的RESTful API与OpenAI的接口兼容。以下是服务启动的示例：

lmdeploy serve api_server OpenGVLab/InternVL2-2B-AWQ --backend turbomind --server-port 23333 --model-format awq

为了使用OpenAI样式的接口，需要安装OpenAI库：

pip install openai

然后，通过以下代码进行API调用：

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'describe this image',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

许可和引用

该项目根据MIT许可协议开放，而InternLM2则根据Apache-2.0许可协议开放。如果您在研究中发现该项目有用，请考虑引用相关论文：

@article{chen2023internvl,
  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2312.14238},
  year={2023}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}

总之，InternVL2-2B-AWQ项目在视觉基础模型的扩展和通用视觉语言任务的对齐上做出了显著进展，为研究和商业应用提供了强有力的支持。