mobilevitv2-1.0-imagenet1k-256 - MobileViTv2中的可分离自注意力实现高效图像分类

项目介绍：MobileViTv2-1.0-ImageNet1k-256

背景介绍

MobileViTv2是MobileViT的第二个版本。这一模型由Sachin Mehta和Mohammad Rastegari在论文《Separable Self-attention for Mobile Vision Transformers》中提出，并在苹果的ml-cvnets的代码库中首次发布。该项目使用苹果示例代码许可证。

模型概述

MobileViTv2引入了可分离自注意力机制，以取代MobileViT中的多头自注意力。这一设计的目的在于保持模型的高效性，同时提升在图像识别任务中的表现。

使用与应用

MobileViTv2模型专为图像分类任务设计。用户可以利用未经微调的原始模型进行图像分类操作，例如将COCO 2017数据集中的一张图片分类成1,000个ImageNet类之一。

以下是一个简要的代码示例，展示如何利用此模型进行图像分类：

from transformers import MobileViTImageProcessor, MobileViTV2ForImageClassification
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = MobileViTImageProcessor.from_pretrained("shehan97/mobilevitv2-1.0-imagenet1k-256")
model = MobileViTV2ForImageClassification.from_pretrained("shehan97/mobilevitv2-1.0-imagenet1k-256")

inputs = feature_extractor(images=image, return_tensors="pt")

outputs = model(**inputs)
logits = outputs.logits

predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

目前，该模型及特征提取器皆支持PyTorch框架。

训练数据

MobileViTv2模型在ImageNet-1k数据集上进行了预训练。该数据集包含了100万张图像，涉及1,000个类别。这一大规模的数据量确保了模型具有优秀的泛化能力和分类准确性。

学术引用

如果您在学术研究中使用了MobileViTv2，可以参考以下BibTeX格式的引用信息：

@inproceedings{vision-transformer,
title = {Separable Self-attention for Mobile Vision Transformers},
author = {Sachin Mehta and Mohammad Rastegari},
year = {2022},
URL = {https://arxiv.org/abs/2206.02680}
}

通过这篇文章，希望您更加了解MobileViTv2项目的背景、技术细节及其应用场景。这一模型为移动设备上的图像处理任务提供了高效且准确的解决方案。