将物体识别视为下一个标记预测

arXiv | Colab | 文档 | Hugging Face

我们的模型对"塞尔达传说：王国之泪"图像的前30个预测及其概率 ¹。

简介

这是论文将物体识别视为下一个标记预测的官方PyTorch实现，该论文已被CVPR 2024接收（亮点论文）。

@inproceedings{nxtp,
  title     = {{Object Recognition as Next Token Prediction}},
  author    = {Kaiyu Yue and Bor-Chun Chen and Jonas Geiping and Hengduo Li and Tom Goldstein and Ser-Nam Lim},
  booktitle = {Computer Vision and Pattern Recognition Conference (CVPR)},
  year      = {2024}
}

更新

2024年5月26日

添加ImageNet实验：查看 src/imagenet
在推理过程中可视化解码器层的注意力图：查看示例

2024年3月17日

发布在G70M上训练的最佳1.78B模型
导出onnx模型：docs/onnx-export

2024年3月3日

在本readme中添加示例，展示前20个预测结果
在评估指标中添加CLIP ViT-L/14作为文本嵌入模型（论文中的表A.8）

方法

本项目深入研究计算机视觉中的一个基本问题——物体识别，将图像转换为物体标签。

线性模型（如ResNet）和对比模型（如CLIP）需要在推理之前预定义标签，这限制了它们在实际应用中的灵活性。

我们将W扩展到整个文本空间，使用像LLaMA的32K标记嵌入这样的语言模型。我们的模型通过自回归处理以真正开放的方式预测标签。

此外，我们的一次性采样技术能够高效地进行大规模判别预测，例如前100个标签。

发布的模型有1.78B参数。将模型截断至0.77B参数仍能达到有竞争力的性能（论文中的表3），该模型在解码器中仅有一个transformer块。

示例

图像与前20个预测结果	注意力图	图像与前20个预测结果	注意力图
<p align="left"><img width="256" height="164" src="https://yellow-cdn.veclightyear.com/835a84d5/60bcd385-72c4-4bf3-9e18-70be17d2f54a.jpg"><br/></p><details><summary>点击查看 ¹</summary>`概率: 0.13949 - 图例`<br/>`概率: 0.12399 - 天空`<br/>`概率: 0.04723 - 云`<br/>`概率: 0.04642 - 游戏`<br/>`概率: 0.04500 - 截图`<br/>`概率: 0.03189 - 顶部`<br/>`概率: 0.03024 - 山`<br/>`概率: 0.02262 - 悬崖`<br/>`概率: 0.01790 - 世界`<br/>`概率: 0.01483 - Wii`<br/>`概率: 0.01440 - 视频`<br/>`概率: 0.01310 - 呼吸`<br/>`概率: 0.01087 - 泽奥`<br/>`概率: 0.00982 - 塞尔达`<br/>`概率: 0.00959 - 角色`<br/>`概率: 0.00865 - 岩石`<br/>`概率: 0.00816 - 林克`<br/>`概率: 0.00788 - 岛屿`<br/>`概率: 0.00624 - 冒险`<br/>`概率: 0.00591 - 女性`</details>	<p align="left"><img width="164" height="164" src="https://yellow-cdn.veclightyear.com/835a84d5/5babdb5a-489b-4c77-82ca-289d23ca0d92.png"><br/></p><details><summary>注意力图信息</summary>`解码器: 层 0: 头 25`</details>	<p align="left"><img width="256" height="164" src="https://yellow-cdn.veclightyear.com/835a84d5/ce330477-7d4a-4913-9ddf-8fe6e258ed2b.jpg"></p> <details><summary>点击查看 ²</summary>`概率: 0.23237 - 火箭`<br/>`概率: 0.10435 - 发射`<br/>`概率: 0.06144 - 联盟号`<br/>`概率: 0.04314 - 太空`<br/>`概率: 0.03541 - 烟`<br/>`概率: 0.03249 - 天空`<br/>`概率: 0.01971 - 航天飞机`<br/>`概率: 0.01566 - 塔`<br/>`概率: 0.01551 - 巴黎`<br/>`概率: 0.01229 - 云`<br/>`概率: 0.01067 - 发射台`<br/>`概率: 0.01050 - 角`<br/>`概率: 0.00983 - 猎鹰`<br/>`概率: 0.00956 - 照片`<br/>`概率: 0.00834 - 升空`<br/>`概率: 0.00814 - 空气`<br/>`概率: 0.00779 - 任务`<br/>`概率: 0.00710 - 站`<br/>`概率: 0.00688 - 七月`<br/>`概率: 0.00647 - 卫星`</details>	<p align="left"><img width="164" height="164" src="https://yellow-cdn.veclightyear.com/835a84d5/61422ffa-ce2d-4972-9d37-69504825c2ba.png"><br/></p><details><summary>注意力图信息</summary>`解码器: 层 0: 头 0`</details>
<p align="left"><img width="196" height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/b5f3e4b0-2ef4-4fb4-9c8d-e5e440f84fd0.png"></p> <details><summary>点击查看 ³</summary>`概率: 0.30731 - 狗`<br/>`概率: 0.13647 - 毛衣`<br/>`概率: 0.11870 - 帽子`<br/>`概率: 0.06812 - 围巾`<br/>`概率: 0.04131 - 砖`<br/>`概率: 0.03114 - 墙`<br/>`概率: 0.01796 - 衬衫`<br/>`概率: 0.01471 - 可爱`<br/>`概率: 0.01156 - 帽子`<br/>`概率: 0.00982 - 脖子`<br/>`概率: 0.00929 - 顶部`<br/>`概率: 0.00797 - 头`<br/>`概率: 0.00777 - 无檐帽`<br/>`概率: 0.00658 - 男人`<br/>`概率: 0.00588 - 坐着`<br/>`概率: 0.00582 - 外套`<br/>`概率: 0.00524 - 夹克`<br/>`概率: 0.00476 - 领子`<br/>`概率: 0.00460 - 脸`<br/>`概率: 0.00119 - 骨头`</details>	<p align="left"><img width="196" height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/08d0e93a-6c2b-48a9-b0d5-1afd8d29cef9.png"><br/></p><details><summary>注意力图信息</summary>`解码器: 层 0: 头 25`</details>	<p align="left"><img width="256" height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/e13616f0-1a15-4098-a6e2-914fc14393f4.jpg"></p> <details><summary>点击查看 ⁴</summary>`概率: 0.14861 - 咖啡`<br/>`概率: 0.10409 - 商店`<br/>`概率: 0.08065 - 柜台`<br/>`概率: 0.04603 - 酒吧`<br/>`概率: 0.04055 - 餐厅`<br/>`概率: 0.03691 - 内部`<br/>`概率: 0.03468 - 区域`<br/>`概率: 0.02638 - 商店`<br/>`概率: 0.02219 - 桌子`<br/>`概率: 0.01930 - 室内`<br/>`概率: 0.01347 - 许多`<br/>`概率: 0.01156 - 食物`<br/>`概率: 0.01058 - 顾客`<br/>`概率: 0.01001 - 房间`<br/>`概率: 0.00923 - 星巴克`<br/>`概率: 0.00853 - 面包店`<br/>`概率: 0.00738 - 视图`<br/>`概率: 0.00738 - 地板`<br/>`概率: 0.00733 - 咖啡馆`<br/>`概率: 0.00633 - 架子`</details>	<p align="left"><img width="196" height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/a5957c1b-0135-420f-bb53-f399b1c663b1.png"><br/></p><details><summary>注意力图信息</summary>`解码器: 层 0: 头 8`</details>
<p align="left"><img width="256" src="https://yellow-cdn.veclightyear.com/835a84d5/86a579ff-1c8a-4f05-af1e-f507a72e3df8.png"></p> <details><summary>点击查看³</summary>`概率: 0.47652 - 怪物`<br/>`概率: 0.09664 - 卡通`<br/>`概率: 0.03812 - 角色`<br/>`概率: 0.03724 - 群组`<br/>`概率: 0.03312 - 生物`<br/>`概率: 0.02111 - 可爱`<br/>`概率: 0.01929 - 矢量`<br/>`概率: 0.01481 - 动物`<br/>`概率: 0.00955 - 艺术`<br/>`概率: 0.00924 - 外星人`<br/>`概率: 0.00837 - 姿势`<br/>`概率: 0.00604 - 泡泡`<br/>`概率: 0.00553 - 眼睛`<br/>`概率: 0.00533 - 颜色`<br/>`概率: 0.00528 - 手`<br/>`概率: 0.00477 - 设计`<br/>`概率: 0.00474 - 壁纸`<br/>`概率: 0.00462 - 孩子`<br/>`概率: 0.00445 - 人物`<br/>`概率: 0.00445 - 家庭`</details>	<p align="left"><img width="164" height="164" src="https://yellow-cdn.veclightyear.com/835a84d5/739f55a9-760b-48fb-86e4-61e3aafccc92.png"><br/></p><details><summary>注意力图信息</summary>`解码器: 第2层: 第7头`</details>	<p align="left"><img width="256" src="https://yellow-cdn.veclightyear.com/835a84d5/ed07b0d4-3adb-47a3-8954-54ed6eb31a52.png"></p> <details><summary>点击查看³</summary>`概率: 0.54375 - 云`<br/>`概率: 0.09932 - 词`<br/>`概率: 0.07571 - 天空`<br/>`概率: 0.03153 - 字母`<br/>`概率: 0.01862 - 索拉`<br/>`概率: 0.01380 - 标志`<br/>`概率: 0.00995 - 文本`<br/>`概率: 0.00715 - 顶部`<br/>`概率: 0.00715 - 蓝色`<br/>`概率: 0.00677 - 标题`<br/>`概率: 0.00608 - 照片`<br/>`概率: 0.00427 - 图片`<br/>`概率: 0.00288 - 索诺拉`<br/>`概率: 0.00269 - 中间`<br/>`概率: 0.00257 - 风暴`<br/>`概率: 0.00202 - 云景`<br/>`概率: 0.00190 - 太阳`<br/>`概率: 0.00189 - 艺术`<br/>`概率: 0.00156 - 翱翔`<br/>`概率: 0.00041 - 结冰的`</details>	<p align="left"><img width="164" height="164" src="https://yellow-cdn.veclightyear.com/835a84d5/a7e0c8ef-ede3-4bf5-9a4d-d66231b0deaf.png"><br/></p><details><summary>注意力图信息</summary>`解码器: 第1层: 第13头`</details>
<p align="left"><img width="256" height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/8f241a06-a81c-4665-8f06-581d3cc4373a.png"></p> <details><summary>点击查看³</summary>`概率: 0.15317 - 建筑`<br/>`概率: 0.13619 - 波浪`<br/>`概率: 0.04782 - 房间`<br/>`概率: 0.03498 - 中间`<br/>`概率: 0.03188 - 大厅`<br/>`概率: 0.02367 - 人群`<br/>`概率: 0.02135 - 海洋`<br/>`概率: 0.02087 - 地板`<br/>`概率: 0.01867 - 世界`<br/>`概率: 0.01773 - 内部`<br/>`概率: 0.01548 - 男人`<br/>`概率: 0.01380 - 水`<br/>`概率: 0.01205 - 视图`<br/>`概率: 0.01200 - 冲浪者`<br/>`概率: 0.01109 - 照片`<br/>`概率: 0.00798 - 酒店`<br/>`概率: 0.00734 - 城市`<br/>`概率: 0.00662 - 游泳池`<br/>`概率: 0.00566 - 艺术`<br/>`概率: 0.00319 - 壁画`</details>	<p align="left"><img width="196" height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/a89d95aa-6e70-4b25-acbb-41ae031d0214.png"><br/></p><details><summary>注意力图信息</summary>`解码器: 第1层: 第16头`</details>	<p align="left"><img height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/678756c7-2cb4-421a-bc03-dc177f65fd44.png"></p> <details><summary>点击查看³</summary>`概率: 0.25673 - 鸟`<br/>`概率: 0.21676 - 羽毛`<br/>`概率: 0.18550 - 孔雀`<br/>`概率: 0.04251 - 头部`<br/>`概率: 0.03240 - 蓝色`<br/>`概率: 0.02507 - 鸽子`<br/>`概率: 0.02183 - 尾巴`<br/>`概率: 0.01339 - 毛发`<br/>`概率: 0.01187 - 顶部`<br/>`概率: 0.00677 - 脸`<br/>`概率: 0.00631 - 相机`<br/>`概率: 0.00463 - 喙`<br/>`概率: 0.00451 - 眼睛`<br/>`概率: 0.00419 - 栅栏`<br/>`概率: 0.00370 - 坐着`<br/>`概率: 0.00333 - 栖息`<br/>`概率: 0.00330 - 照片`<br/>`概率: 0.00318 - 墙`<br/>`概率: 0.00269 - 动物`<br/>`概率: 0.00106 - 松鸦`</details>	<p align="left"><img width="196" height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/46a99a52-68f7-4104-aafc-ef4f58bc7865.png"><br/></p><details><summary>注意力图信息</summary>`解码器: 第1层: 第25头`</details>
<p align="left"><img width="256" height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/4e3b3eda-e273-482e-87b2-593b1d2adc21.jpg"></p> <details><summary>点击查看 ⁵</summary>`概率: 0.07247 - 平板电脑`<br/>`概率: 0.06770 - 咖啡`<br/>`概率: 0.06562 - 窗户`<br/>`概率: 0.05829 - 控制器`<br/>`概率: 0.05668 - 游戏`<br/>`概率: 0.04802 - 开关`<br/>`概率: 0.04043 - Wii`<br/>`概率: 0.03798 - 游戏机`<br/>`概率: 0.03563 - 杯子`<br/>`概率: 0.02570 - 顶部`<br/>`概率: 0.02067 - 马克杯`<br/>`概率: 0.01808 - 屏幕`<br/>`概率: 0.01344 - 视频`<br/>`概率: 0.01105 - 星星`<br/>`概率: 0.01092 - 任天堂`<br/>`概率: 0.01055 - 电脑`<br/>`概率: 0.00819 - 马里奥`<br/>`概率: 0.00815 - 遥控器`<br/>`概率: 0.00736 - 控制`<br/>`概率: 0.00393 - 窗台`</details>	<p align="left"><img width="196" height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/883f6cda-53ad-4c32-b45e-f1c270a008cb.png"><br/></p><details><summary>注意力图信息</summary>`解码器: 层 0: 头 12`</details>	<p align="left"><img width="256" height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/5754728a-1bcd-4e96-921e-99afa92dd394.jpg"></p> <details><summary>点击查看 ⁶</summary>`概率: 0.36523 - 飞机`<br>`概率: 0.09151 - 货物`<br>`概率: 0.07531 - 飞机`<br>`概率: 0.05538 - 船`<br>`概率: 0.04223 - 集装箱`<br>`概率: 0.03105 - 水`<br>`概率: 0.03040 - 视图`<br>`概率: 0.02277 - 码头`<br>`概率: 0.01685 - 港口`<br>`概率: 0.01434 - 天空`<br>`概率: 0.01328 - 航运`<br>`概率: 0.00788 - 中间`<br>`概率: 0.00751 - 机身`<br>`概率: 0.00717 - 照片`<br>`概率: 0.00715 - 喷气机`<br>`概率: 0.00714 - 城市`<br>`概率: 0.00621 - 海洋`<br>`概率: 0.00615 - 货运`<br>`概率: 0.00609 - 船`<br>`概率: 0.00320 - 运输`</details>	<p align="left"><img width="196" height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/e0771041-f0b1-4f3c-ab16-debd6ab7a6ac.png"><br/></p><details><summary>注意力图信息</summary>`解码器: 层 2: 头 14`</details>
<p align="left"><img height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/7a2e986c-8ee4-4b82-b5c9-74ba309bd5ee.jpg"></p> <details><summary>点击查看 ⁶</summary>`概率: 0.15236 - 糖果`<br/>`概率: 0.12271 - 毛衣`<br/>`概率: 0.11457 - 眼镜`<br/>`概率: 0.10593 - 狗`<br/>`概率: 0.08311 - 椅子`<br/>`概率: 0.07111 - 手杖`<br/>`概率: 0.04701 - 太阳镜`<br/>`概率: 0.04589 - 圣诞`<br/>`概率: 0.02361 - 服装`<br/>`概率: 0.02085 - 穿着`<br/>`概率: 0.01870 - 帽子`<br/>`概率: 0.00734 - 头部`<br/>`概率: 0.00636 - 顶部`<br/>`概率: 0.00577 - 装扮`<br/>`概率: 0.00520 - 巧克力`<br/>`概率: 0.00437 - 霍利`<br/>`概率: 0.00362 - 西装`<br/>`概率: 0.00344 - 衬衫`<br/>`概率: 0.00322 - 草莓`<br/>`概率: 0.00211 - 假发`</details>	<p align="left"><img width="196" height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/eae0c03d-d137-4662-b434-52f08b97676e.png"><br/></p><details><summary>注意力图信息</summary>`解码器: 层 1: 头 16`</details>	<p align="left"><img width="256" height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/df98118b-40da-4a62-b695-dfcb26d80770.jpg"></p> <details><summary>点击查看 ⁶</summary>`概率: 0.19960 - 客厅`<br/>`概率: 0.16291 - 房间`<br/>`概率: 0.11353 - 沙发`<br/>`概率: 0.06036 - 长沙发`<br/>`概率: 0.04741 - 地毯`<br/>`概率: 0.04704 - 咖啡`<br/>`概率: 0.03795 - 狗`<br/>`概率: 0.03659 - 墙`<br/>`概率: 0.02980 - 桌子`<br/>`概率: 0.01611 - 地板`<br/>`概率: 0.01594 - 灰色`<br/>`概率: 0.01472 - 木头`<br/>`概率: 0.01353 - 家具`<br/>`概率: 0.01314 - 植物`<br/>`概率: 0.01274 - 壁炉`<br/>`概率: 0.01161 - 枕头`<br/>`概率: 0.00941 - 椅子`<br/>`概率: 0.00512 - 家`<br/>`概率: 0.00434 - 毯子`<br/>`概率: 0.00351 - 艺术`</details>	<p align="left"><img width="196" height="196" src="https://yellow-cdn.veclightyear.com/835a84d5/f07d37f9-df62-4667-8269-7e1bec967181.png"><br/></p><details><summary>注意力图信息</summary>`解码器: 层 1: 头 16`</details>

模型

下表显示了使用前10个预测在验证集上复现的召回率结果（论文表1中的R列）。

<table> <tbody> <th valign="bottom">参数量</th> <th valign="bottom">训练组</th> <th valign="bottom">检查点</th> <th valign="bottom">md5</th> <th valign="bottom">CC3M</th> <th valign="bottom">COCO</th> <th valign="bottom">OpenImages</th> <tr> <td align="center">1.78B</td> <td align="center">  G3M</td> <td align="center"><a href="https://huggingface.co/kaiyuyue/nxtp/blob/main/ckpt_epoch_03_iter_0021360.pth">Hugging Face</a></td> <td align="center"><tt>b2a69b</tt></td> <td align="center">0.740</td> <td align="center">0.703</td> <td align="center">0.616</td> </tr> <tr> <td align="center">1.78B</td> <td align="center">G70M</td> <td align="center"><a href="https://huggingface.co/kaiyuyue/nxtp/blob/main/ckpt_epoch_03_iter_1656549.pth">Hugging Face</a></td> <td align="center"><tt>e177c7</tt></td> <td align="center">0.721</td> <td align="center">0.765</td> <td align="center">0.662</td> </tr> </tbody> </table>

下载

可以从上表中的链接下载检查点。对于从Hugging Face下载，一种选择是使用git-lfs：

# 安装git lfs
git lfs install

# 在终端中下载检查点
git clone https://huggingface.co/kaiyuyue/nxtp

此外，也可以从网页浏览器的模型页面下载检查点。

推理

这里有一张图片assets/starbux.jpg用于快速测试。首先，请按照Dependencies中的说明准备环境。

要对图像进行推理，请运行

python src/infer.py \
  --ckpt-path path/to/model/checkpoint \
  --img-path assets/starbux.jpg \
  --num-labels 20

在G3M上训练的模型输出将是

前20个预测：
| 概率：0.05742 - 咖啡
| 概率：0.05525 - 餐厅
| 概率：0.04402 - 商店
| 概率：0.02528 - 房间
| 概率：0.02468 - 店铺
| 概率：0.02381 - 室内
| 概率：0.01732 - 区域
| 概率：0.01640 - 建筑
| 概率：0.01616 - 食物
| 概率：0.01408 - 酒吧
| 概率：0.01247 - 顾客
| 概率：0.01134 - 视图
| 概率：0.01059 - 地板
| 概率：0.01045 - 桌子
| 概率：0.00933 - 厨房
| 概率：0.00926 - 家
| 概率：0.00872 - 看
| 概率：0.00841 - 人
| 概率：0.00693 - 杯子
| 概率：0.00665 - 柜台

在G70M上训练的模型输出将是

前20个预测：
| 概率：0.15203 - 咖啡
| 概率：0.09728 - 商店
| 概率：0.09182 - 柜台
| 概率：0.03848 - 室内
| 概率：0.03389 - 酒吧
| 概率：0.03215 - 餐厅
| 概率：0.02440 - 桌子
| 概率：0.02245 - 店铺
| 概率：0.01950 - 区域
| 概率：0.01905 - 内部
| 概率：0.01590 - 星巴克
| 概率：0.01313 - 咖啡馆
| 概率：0.01220 - 椅子
| 概率：0.01172 - 地板
| 概率：0.01020 - 杯子
| 概率：0.00879 - 饮料
| 概率：0.00794 - 房间
| 概率：0.00746 - 顾客
| 概率：0.00635 - 木头
| 概率：0.00345 - 面包店

许可证

本项目采用CC-BY-NC 4.0许可证。详情请见LICENSE。

图片来源：<a href="https://www.nintendo.com/jp/zelda/totk/index.html">塞尔达传说王国之泪</a>。 ↩ ↩²
图片来源：<a href="https://www.spacex.com/vehicles/falcon-9/">Space-X</a>。 ↩
图片来源：<a href="https://openai.com/sora">OpenAI Sora</a>。 ↩ ↩² ↩³ ↩⁴ ↩⁵
图片来源：作者在星巴克店内拍摄的照片。 ↩
图片来源：<a href="https://www.instagram.com/p/C027tvEhz7J/">超级马里奥兄弟惊奇</a>。 ↩
图片来源：<a href="https://segment-anything.com/demo">Segment Anything演示 | Meta AI</a>。 ↩ ↩² ↩³