vit_small_patch8_224.dino - 基于自监督DINO的图像特征提取Transformer

介绍vit_small_patch8_224.dino项目

项目背景

vit_small_patch8_224.dino是一个视觉变换器（Vision Transformer, ViT）模型，专注于图像特征提取。该模型采用自监督学习方法DINO进行训练，具备强大的图像分类和特征识别能力。ViT模型是通过将图像划分为固定大小的补丁，然后应用像自然语言处理中的Transformer架构而开发的。

模型详细信息

此模型主要用于图像分类及作为特征提取的基础网络。以下是该模型的一些重要参数：

参数数量(M): 21.7
GMACs: 16.8
激活数(M): 32.9
图像尺寸: 224 x 224

该模型结合了两篇论文的研究成果：

Emerging Properties in Self-Supervised Vision Transformers: 探讨自监督视觉变换器中的新兴特性。
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: 在大规模图像识别中使用变换器的研究。

预训练数据集采用ImageNet-1k，模型的原始代码托管在 DINO的GitHub库。

模型使用

图像分类

当使用vit_small_patch8_224.dino进行图像分类时，用户需导入相应库并加载预训练模型。可以通过Python代码对单张图像进行预处理，然后通过模型运行得到分类结果，包括预测的前五个类别及其概率。

示例代码如下：

from urllib.request import urlopen
from PIL import Image
import timm
import torch

img = Image.open(urlopen('URL_TO_IMAGE'))

model = timm.create_model('vit_small_patch8_224.dino', pretrained=True)
model = model.eval()

data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

图像嵌入

模型同样支持图像嵌入功能，可以提取出图像的特征向量，用于进一步的图像分析任务，比如聚类或图像检索。

示例代码如下：

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen('URL_TO_IMAGE'))

model = timm.create_model(
    'vit_small_patch8_224.dino',
    pretrained=True,
    num_classes=0,  # 移除分类器
)
model = model.eval()

data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))

模型比较

用户可以在timm的模型结果页面中探索该模型与其他模型的性能比较。

引用

研究和开发vit_small_patch8_224.dino的过程中参考了多篇学术论文，并在相关领域中得到了较高的认可。这些论文包括对自监督视觉变换器和大规模图像识别的深度研究，用户可以通过论文引用信息进一步了解该领域的前沿动态。