beit_base_patch16_384.in22k_ft_in22k_in1k - 高效的BEiT自监督图像分类与嵌入模型

项目介绍：beit_base_patch16_384.in22k_ft_in22k_in1k

概述

beit_base_patch16_384.in22k_ft_in22k_in1k 是一个用于图像分类的模型。该模型采用了自监督的方式，即通过遮盖图像中的部分像素来训练模型，这一过程被称为自监督遮蔽图像建模（Masked Image Modelling, MIM）。在此过程中，利用了 DALL-E 的离散变分自编码器（dVAE）作为视觉词汇化工具。模型首先在大规模图像数据集 ImageNet-22k 上进行训练，然后在 ImageNet-22k 和 ImageNet-1k 上进行了微调。

模型详情

模型类型： 图像分类/特征骨干网
模型参数：
- 参数数量：86.7百万
- 计算量（GMACs）：55.5
- 激活数：101.6百万
- 图像尺寸：384 x 384
相关论文：
- BEiT: BERT Pre-Training of Image Transformers
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
数据集： ImageNet-1k
预训练数据集： ImageNet-22k
代码仓库： GitHub 项目地址

模型使用方法

图像分类

使用 timm 库可以加载和使用该图像分类模型。以下是一个简单的代码示例，用于对图像进行分类：

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('beit_base_patch16_384.in22k_ft_in22k_in1k', pretrained=True)
model = model.eval()

data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

图像嵌入

以下代码展示了如何获取图像的特征向量：

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'beit_base_patch16_384.in22k_ft_in22k_in1k',
    pretrained=True,
    num_classes=0,  # 移除分类器层
)
model = model.eval()

data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # 输出为（batch_size, num_features）格式的张量

# 或者使用以下代码执行相同操作（无需设置 num_classes=0）

output = model.forward_features(transforms(img).unsqueeze(0))
# 输出为未池化的（1, 577, 768）格式的张量

output = model.forward_head(output, pre_logits=True)
# 输出为（1, num_features）格式的张量

模型比较

可以通过 timm 模型结果来比较该模型的数据集和运行时指标。

引用

如果在学术研究中使用了该模型，请引用以下论文：

@article{bao2021beit,
  title={Beit: Bert pre-training of image transformers},
  author={Bao, Hangbo and Dong, Li and Piao, Songhao and Wei, Furu},
  journal={arXiv preprint arXiv:2106.08254},
  year={2021}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}