vit_tiny_patch16_224.augreg_in21k - 增强与正则化的ViT图像分类模型

项目介绍：vit_tiny_patch16_224.augreg_in21k

背景介绍

vit_tiny_patch16_224.augreg_in21k是一个用于图像分类的视觉Transformer（ViT）模型。最初，这个模型是由论文作者在JAX框架下并借助ImageNet-21k数据集进行训练的，并且在训练过程中使用了额外的数据增强和正则化技术。后来，该模型被转移到了PyTorch框架中，由Ross Wightman进行实现。

模型详情

模型类型：图像分类 / 特征骨干
模型参数：
- 参数数量（百万）：9.7
- GMACs：1.1
- 激活数（百万）：4.1
- 图像尺寸：224 x 224 像素
相关论文：
- 《如何训练你的ViT？——数据、增强和正则化在视觉Transformer中的应用》
- 《一张图片值得16x16个词：用于大规模图像识别的Transformer》
使用数据集：ImageNet-21k
原始实现：可以在GitHub上查看

模型应用

图像分类

在图像分类任务中，该模型可以通过PyTorch Image Models库（timm）轻松加载和应用。预训练模型可以用来分析输入图像，返回对应的类别概率。如下是一个基本使用例子：

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'))

model = timm.create_model('vit_tiny_patch16_224.augreg_in21k', pretrained=True)
model = model.eval()

# 获取模型特定的转换（如归一化，调整尺寸）
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # 将单个图片扩展成batch

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

图像嵌入

除了图像分类，该模型还可以用于生成图像嵌入，这通常用于计算机视觉任务中的特征提取。此操作通过移除模型的分类层（nn.Linear），可以直接得到图像的特征向量。

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'))

model = timm.create_model('vit_tiny_patch16_224.augreg_in21k', pretrained=True, num_classes=0)  # 移除分类层
model = model.eval()

# 获取模型特定的转换（如归一化，调整尺寸）
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # 输出为（batch_size, num_features）形状的tensor

# 或者使用如下等效方法（无需设置num_classes=0）

output = model.forward_features(transforms(img).unsqueeze(0))  # 输出为未聚合的tensor

output = model.forward_head(output, pre_logits=True)  # 输出为（1, num_features）形状的tensor

性能比较

可以在timm模型结果页中探索该模型的详细数据集性能及运行时间指标。

引用信息

若需要在研究中引用此模型或相关技术文献，请参考以下参考文献：

@article{steiner2021augreg,
  title={How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers},
  author={Steiner, Andreas and Kolesnikov, Alexander and and Zhai, Xiaohua and Wightman, Ross and Uszkoreit, Jakob and Beyer, Lucas},
  journal={arXiv preprint arXiv:2106.10270},
  year={2021}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}