混合潜在扩散 [SIGGRAPH 2023]

混合潜在扩散

Omri Avrahami, Ohad Fried, Dani Lischinski

摘要：神经图像生成的巨大进步，加上看似无所不能的视觉语言模型的出现，终于实现了用于创建和编辑图像的基于文本的界面。处理通用图像需要一个多样化的底层生成模型，因此最新的工作利用了扩散模型，这些模型在多样性方面已被证明超越了GANs。然而，扩散模型的一个主要缺点是推理时间相对较慢。在本文中，我们为局部文本驱动的通用图像编辑任务提出了一个加速解决方案，其中所需的编辑局限于用户提供的蒙版内。我们的解决方案利用了最近的文本到图像潜在扩散模型（LDM），该模型通过在低维潜在空间中操作来加速扩散。我们首先通过将混合扩散融入其中，将LDM转换为局部图像编辑器。接下来，我们提出了一个基于优化的解决方案，以解决这个LDM无法准确重建图像的固有问题。最后，我们解决了使用细蒙版进行局部编辑的场景。我们通过定性和定量方法评估了我们的方法与现有基线的比较，并证明除了更快之外，我们的方法还实现了比基线更好的精度，同时缓解了一些基线的伪影。

应用

背景编辑

文本生成

多重预测

改变现有对象

添加新对象

涂鸦编辑

安装

安装conda虚拟环境：

$ conda env create -f environment.yaml
$ conda activate ldm

使用方法

新功能 :fire: - Stable Diffusion 实现

你可以使用基于Diffusers库的较新Stable Diffusion实现。为此，你需要通过以下命令安装PyTorch 2.1和Diffusers：

$ conda install pytorch==2.1.0 torchvision==0.16.0  pytorch-cuda=11.8 -c pytorch -c nvidia
$ pip install -U diffusers==0.19.3

要使用Stable Diffusion XL（需要更强大的GPU），请使用以下脚本：

$ python scripts/text_editing_SDXL.py --prompt "a stone" --init_image "inputs/img.png" --mask "inputs/mask.png"

你可以使用更小的--batch_size以节省GPU内存。

要使用Stable Diffusion v2.1，请使用以下脚本：

$ python scripts/text_editing_SD2.py --prompt "a stone" --init_image "inputs/img.png" --mask "inputs/mask.png"

旧版 - 潜在扩散模型实现

要使用基于潜在扩散模型（LDM）的旧实现，你首先需要下载预训练权重（5.7GB）：

$ mkdir -p models/ldm/text2img-large/
$ wget -O models/ldm/text2img-large/model.ckpt https://ommer-lab.com/files/latent-diffusion/nitro/txt2img-f8-large/model.ckpt

如果上述链接无法使用，你可以使用这个镜像链接。

然后，编辑图像可能需要两个步骤：

步骤 1 - 生成初始预测

$ python scripts/text_editing_LDM.py --prompt "a pink yarn ball" --init_image "inputs/img.png" --mask "inputs/mask.png"

预测结果将保存在outputs/edit_results/samples中。

你可以通过指定--n_samples为使你的GPU饱和的最大数值来使用更大的批量大小。

步骤 2（可选） - 重建原始背景

如果你想重建原始图像背景，可以运行以下命令：

$ python scripts/reconstruct.py --init_image "inputs/img.png" --mask "inputs/mask.png" --selected_indices 0 1

你可以选择要重建的特定图像索引。结果将保存在outputs/edit_results/samples/reconstructed_optimization中。

引用

如果你发现这个项目对你的研究有用，请引用以下内容：

@article{avrahami2023blendedlatent,
        author = {Avrahami, Omri and Fried, Ohad and Lischinski, Dani},
        title = {Blended Latent Diffusion},
        year = {2023},
        issue_date = {August 2023},
        publisher = {Association for Computing Machinery},
        address = {New York, NY, USA},
        volume = {42},
        number = {4},
        issn = {0730-0301},
        url = {https://doi.org/10.1145/3592450},
        doi = {10.1145/3592450},
        journal = {ACM Trans. Graph.},
        month = {jul},
        articleno = {149},
        numpages = {11},
        keywords = {zero-shot text-driven local image editing}
}

@InProceedings{Avrahami_2022_CVPR,
        author    = {Avrahami, Omri and Lischinski, Dani and Fried, Ohad},
        title     = {Blended Diffusion for Text-Driven Editing of Natural Images},
        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        month     = {June},
        year      = {2022},
        pages     = {18208-18218}
}