InternViT-6B-448px-V1-5 - 提升视觉模型分辨率及多语言OCR精度

InternViT-6B-448px-V1-5项目介绍

项目概述

InternViT-6B-448px-V1-5是基于InternViT-6B-448px-V1-2的强大基础预训练模型开发而成的。此次更新将训练图像的分辨率从448×448扩展到动态的范围，基本瓦片大小为448×448，可扩展至最多12块。此外，项目还加强了预训练数据集的规模、质量和多样性，使得InternViT-6B-448px-V1-5展现出更强的鲁棒性、光学字符识别（OCR）能力和高分辨率处理能力。

模型详情

模型类型： 视觉基础模型，特征骨干
模型统计：
- 参数数量（百万）：5540（最后三个模块被丢弃）
- 图像尺寸：448 x 448，训练时使用1到12个瓦片
预训练数据集： 悉数使用了多种大型数据集，包括LAION-en、LAION-zh、COYO、GRIT、COCO、TextCaps、Objects365、OpenImages、All-Seeing、Wukong-OCR、LaionCOCO-OCR以及其他OCR相关数据集。为了增强模型的OCR能力，我们在普通的字幕数据集之外，特意加入了更多的OCR数据。具体而言，在从Wukong中提取的图像上使用PaddleOCR进行中文OCR处理，在LAION-COCO的图像上进行英文OCR处理。
特殊说明： InternViT-6B最初有48个模块，我们发现使用倒数第四个模块的输出效果最好。为便于使用和减少GPU内存占用，只保留了45个模块，将参数数量从5.9B减少到5.5B。因此，若需基于此模型构建多模态语言模型（MLLM），请务必利用最后一层的特征。

模型使用示例（图像嵌入）

以下是使用Python代码的示例，展示了如何加载和应用InternViT-6B-448px-V1-5模型以进行图像嵌入：

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

model = AutoModel.from_pretrained(
    'OpenGVLab/InternViT-6B-448px-V1-5',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image = Image.open('./examples/image1.jpg').convert('RGB')

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px-V1-5')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

outputs = model(pixel_values)

鸣谢与引用

若在科研中发现此项目有用，请考虑引用以下文献：

@article{chen2023internvl,
  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2312.14238},
  year={2023}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}

通过这些改进和调整，InternViT-6B-448px-V1-5在性能上取得了显著提升，是图像特征提取和视觉语言任务的有力工具。