小巧而强大 - 仅使用32个标记实现图像重建和生成的标记化技术!

我们提出了一种紧凑的一维标记器，可以用仅32个离散标记表示一张图像。因此，它在保持具有竞争力的生成质量的同时，大大加快了采样过程（例如，比DiT-XL/2快410倍）。

teaser

更新

2024/08/09: 改进了从huggingface模型加载预训练权重的支持，感谢@NielsRogge的帮助！
2024/07/03: 发布了用于复现论文中报告结果的评估脚本，TiTok-B64和TiTok-S128的检查点可用。
2024/06/21: 发布演示代码和TiTok-L-32检查点。
2024/06/11: 本项目的技术报告已可获取。

🚀 贡献

我们提出了一种新颖的一维图像标记化框架，打破了二维标记化方法中存在的网格限制，从而实现了更灵活和紧凑的图像潜在表示。

提出的一维标记器可以将256 × 256的图像标记化为仅32个离散标记，在保持最先进的生成质量的同时，使生成过程显著加速（比扩散模型快数百倍）。

我们进行了一系列实验来探索鲜有研究的一维图像标记化的特性，为高效有效的图像表示的紧凑潜在空间铺平了道路。

模型库

数据集	模型	链接	FID
ImageNet	TiTok-L-32 标记器	检查点	2.21 (重建)
ImageNet	TiTok-B-64 标记器	检查点	1.70 (重建)
ImageNet	TiTok-S-128 标记器	检查点	1.71 (重建)
ImageNet	TiTok-L-32 生成器	检查点	2.77 (生成)
ImageNet	TiTok-B-64 生成器	检查点	2.48 (生成)
ImageNet	TiTok-S-128 生成器	检查点	1.97 (生成)

请注意，这些模型仅在有限的学术数据集ImageNet上训练，仅供研究使用。

安装

pip3 install -r requirements.txt

快速开始

import torch
from PIL import Image
import numpy as np
import demo_util
from huggingface_hub import hf_hub_download
from modeling.maskgit import ImageBert
from modeling.titok import TiTok

titok_tokenizer = TiTok.from_pretrained("yucornetto/tokenizer_titok_l32_imagenet")
titok_tokenizer.eval()
titok_tokenizer.requires_grad_(False)
titok_generator = ImageBert.from_pretrained("yucornetto/generator_titok_l32_imagenet")
titok_generator.eval()
titok_generator.requires_grad_(False)

# 或者，从hf下载
# hf_hub_download(repo_id="fun-research/TiTok", filename="tokenizer_titok_l32.bin", local_dir="./")
# hf_hub_download(repo_id="fun-research/TiTok", filename="generator_titok_l32.bin", local_dir="./")

# 加载配置
# config = demo_util.get_config("configs/titok_l32.yaml")
# titok_tokenizer = demo_util.get_titok_tokenizer(config)
# titok_generator = demo_util.get_titok_generator(config)

device = "cuda"
titok_tokenizer = titok_tokenizer.to(device)
titok_generator = titok_generator.to(device)
# 重构一张图像。即，图像 -> 32个标记 -> 图像
img_path = "assets/ILSVRC2012_val_00010240.png"
image = torch.from_numpy(np.array(Image.open(img_path)).astype(np.float32)).permute(2, 0, 1).unsqueeze(0) / 255.0
# 标记化
encoded_tokens = titok_tokenizer.encode(image.to(device))[1]["min_encoding_indices"]
# 图像assets/ILSVRC2012_val_00010240.png被编码为标记张量([[[ 887, 3979,  349,  720, 2809, 2743, 2101,  603, 2205, 1508, 1891, 4015, 1317, 2956, 3774, 2296,  484, 2612, 3472, 2330, 3140, 3113, 1056, 3779,  654, 2360, 1901, 2908, 2169,  953, 1326, 2598]]], device='cuda:0')，形状为torch.Size([1, 1, 32])
print(f"图像 {img_path} 被编码为标记 {encoded_tokens}，形状为 {encoded_tokens.shape}")
# 解码
reconstructed_image = titok_tokenizer.decode_tokens(encoded_tokens)
reconstructed_image = torch.clamp(reconstructed_image, 0.0, 1.0)
reconstructed_image = (reconstructed_image * 255.0).permute(0, 2, 3, 1).to("cpu", dtype=torch.uint8).numpy()[0]
reconstructed_image = Image.fromarray(reconstructed_image).save("assets/ILSVRC2012_val_00010240_recon.png")

# 生成一张图像
sample_labels = [torch.randint(0, 999, size=(1,)).item()] # 随机IN-1k类别
generated_image = demo_util.sample_fn(
    generator=titok_generator,
    tokenizer=titok_tokenizer,
    labels=sample_labels,
    guidance_scale=4.5,
    randomize_temperature=1.0,
    num_sample_steps=8,
    device=device
)
Image.fromarray(generated_image[0]).save(f"assets/generated_{sample_labels[0]}.png")

我们还提供了一个Jupyter笔记本，作为使用TiTok-L-32重构和生成图像的快速教程。

我们还支持TiTok的HuggingFace 🤗 演示！

在ImageNet-1K基准测试上进行测试

我们提供了一个采样脚本用于重现ImageNet-1K基准测试上的生成结果。

# 准备ADM评估脚本
git clone https://github.com/openai/guided-diffusion.git

wget https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/VIRTUAL_imagenet256_labeled.npz

# 重现TiTok-L-32
torchrun --nnodes=1 --nproc_per_node=8 --rdzv-endpoint=localhost:9999 sample_imagenet.py config=configs/titok_l32.yaml experiment.output_dir="titok_l_32"
# 运行评估脚本。FID结果应约为2.77
python3 guided-diffusion/evaluations/evaluator.py VIRTUAL_imagenet256_labeled.npz titok_l_32.npz

# 重现TiTok-B-64
torchrun --nnodes=1 --nproc_per_node=8 --rdzv-endpoint=localhost:9999 sample_imagenet.py config=configs/titok_b64.yaml experiment.output_dir="titok_b_64"
# 运行评估脚本。FID结果应约为2.48
python3 guided-diffusion/evaluations/evaluator.py VIRTUAL_imagenet256_labeled.npz titok_b_64.npz

# 重现TiTok-S-128
torchrun --nnodes=1 --nproc_per_node=8 --rdzv-endpoint=localhost:9999 sample_imagenet.py config=configs/titok_s128.yaml experiment.output_dir="titok_s_128"
# 运行评估脚本。FID结果应约为1.97
python3 guided-diffusion/evaluations/evaluator.py VIRTUAL_imagenet256_labeled.npz titok_s_128.npz

可视化

teaser

引用

如果您在研究中使用了我们的工作，请使用以下BibTeX条目。

@article{yu2024an,
  author    = {Qihang Yu and Mark Weber and Xueqing Deng and Xiaohui Shen and Daniel Cremers and Liang-Chieh Chen},
  title     = {An Image is Worth 32 Tokens for Reconstruction and Generation},
  journal   = {arxiv: 2406.07550},
  year      = {2024}
}