vit_base_patch32_224.augreg_in21k_ft_in1k - 基于ViT架构的图像分类模型，兼容PyTorch

项目介绍：vit_base_patch32_224.augreg_in21k_ft_in1k

概述

vit_base_patch32_224.augreg_in21k_ft_in1k 是一个用于图像分类的视觉转换器（Vision Transformer）模型。它在ImageNet-21k 数据集上进行了预训练，并在ImageNet-1k数据集上进行了微调。同时，训练过程中采用了数据增强和正则化。这一模型最初由论文作者在JAX中训练，并由Ross Wightman移植到PyTorch。

模型详情

模型类型: 图像分类 / 特征骨干网络
模型统计:
- 参数（百万）: 88.2
- GMACs: 4.4
- 激活（百万）: 4.2
- 图像大小: 224 x 224
相关论文:
- 《How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers》链接: 论文链接
- 《An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale》链接: 论文链接
数据集: ImageNet-1k
预训练数据集: ImageNet-21k
原始项目地址: GitHub链接

模型使用

图像分类

使用这个模型进行图像分类的过程如下：

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_base_patch32_224.augreg_in21k_ft_in1k', pretrained=True)
model = model.eval()

# 获取模型特定的变换（归一化，调整大小）
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

图像嵌入

获取图像嵌入的方法如下：

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_base_patch32_224.augreg_in21k_ft_in1k',
    pretrained=True,
    num_classes=0
)
model = model.eval()

# 获取模型特定的变换（归一化，调整大小）
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))

# 或等效地（无需设置num_classes=0）
output = model.forward_features(transforms(img).unsqueeze(0))
output = model.forward_head(output, pre_logits=True)

模型比较

用户可以在timm的模型结果页面中，探索此模型的性能数据和运行时指标。

引用信息

如果需要引用相关论文，可以使用以下BibTeX格式：

@article{steiner2021augreg,
  title={How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers},
  author={Steiner, Andreas and Kolesnikov, Alexander and and Zhai, Xiaohua and Wightman, Ross and Uszkoreit, Jakob and Beyer, Lucas},
  journal={arXiv preprint arXiv:2106.10270},
  year={2021}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}