Project Icon

HunyuanDiT

实现多分辨率扩散和细粒度中英文理解

HunyuanDiT是一个多分辨率扩散变换器模型,具有细粒度的中英文理解能力。该模型采用优化的变换器结构、文本编码器和位置编码,通过迭代数据流程提升性能。HunyuanDiT支持多轮多模态对话,可根据上下文生成和优化图像。经专业评估,该模型在中文到图像生成方面达到开源模型的先进水平。

Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding


This repo contains PyTorch model definitions, pre-trained weights and inference/sampling code for our paper exploring Hunyuan-DiT. You can find more visualizations on our project page.

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation

🔥🔥🔥 News!!

  • Jul 15, 2024: 🚀 HunYuanDiT and Shakker.Ai have jointly launched a fine-tuning event based on the HunYuanDiT 1.2 model. By publishing a lora or fine-tuned model based on HunYuanDiT, you can earn up to $230 bonus from Shakker.Ai. See Shakker.Ai for more details.
  • Jul 15, 2024: :tada: Update ComfyUI to support standardized workflows and compatibility with weights from t2i module and Lora training for versions 1.1/1.2, as well as those trained by Kohya or the official script. See ComfyUI for details.
  • Jul 15, 2024: :zap: We offer Docker environments for CUDA 11/12, allowing you to bypass complex installations and play with a single click! See dockers for details.
  • Jul 08, 2024: :tada: HYDiT-v1.2 version is released. Please check HunyuanDiT-v1.2 and Distillation-v1.2 for more details.
  • Jul 03, 2024: :tada: Kohya-hydit version now available for v1.1 and v1.2 models, with GUI for inference. Official Kohya version is under review. See kohya for details.
  • Jun 27, 2024: :art: Hunyuan-Captioner is released, providing fine-grained caption for training data. See mllm for details.
  • Jun 27, 2024: :tada: Support LoRa and ControlNet in diffusers. See diffusers for details.
  • Jun 27, 2024: :tada: 6GB GPU VRAM Inference scripts are released. See lite for details.
  • Jun 19, 2024: :tada: ControlNet is released, supporting canny, pose and depth control. See training/inference codes for details.
  • Jun 13, 2024: :zap: HYDiT-v1.1 version is released, which mitigates the issue of image oversaturation and alleviates the watermark issue. Please check HunyuanDiT-v1.1 and Distillation-v1.1 for more details.
  • Jun 13, 2024: :truck: The training code is released, offering full-parameter training and LoRA training.
  • Jun 06, 2024: :tada: Hunyuan-DiT is now available in ComfyUI. Please check ComfyUI for more details.
  • Jun 06, 2024: 🚀 We introduce Distillation version for Hunyuan-DiT acceleration, which achieves 50% acceleration on NVIDIA GPUs. Please check Distillation for more details.
  • Jun 05, 2024: 🤗 Hunyuan-DiT is now available in 🤗 Diffusers! Please check the example below.
  • Jun 04, 2024: :globe_with_meridians: Support Tencent Cloud links to download the pretrained models! Please check the links below.
  • May 22, 2024: 🚀 We introduce TensorRT version for Hunyuan-DiT acceleration, which achieves 47% acceleration on NVIDIA GPUs. Please check TensorRT-libs for instructions.
  • May 22, 2024: 💬 We support demo running multi-turn text2image generation now. Please check the script below.

🤖 Try it on the web

Welcome to our web-based Tencent Hunyuan Bot, where you can explore our innovative products! Just input the suggested prompts below or any other imaginative prompts containing drawing-related keywords to activate the Hunyuan text-to-image generation feature. Unleash your creativity and create any picture you desire, all for free!

You can use simple prompts similar to natural language text

画一只穿着西装的猪

draw a pig in a suit

生成一幅画,赛博朋克风,跑车

generate a painting, cyberpunk style, sports car

or multi-turn language interactions to create the picture.

画一个木制的鸟

draw a wooden bird

变成玻璃的

turn into glass

📑 Open-source Plan

  • Hunyuan-DiT (Text-to-Image Model)
    • Inference
    • Checkpoints
    • Distillation Version
    • TensorRT Version
    • Training
    • Lora
    • Controlnet (Pose, Canny, Depth)
    • 6GB GPU VRAM Inference
    • IP-adapter
    • Hunyuan-DiT-S checkpoints (0.7B model)
  • Mllm
    • Hunyuan-Captioner (Re-caption the raw image-text pairs)
      • Inference
    • Hunyuan-DialogGen (Prompt Enhancement Model)
      • Inference
  • Web Demo (Gradio)
  • Multi-turn T2I Demo (Gradio)
  • Cli Demo
  • ComfyUI
  • Diffusers
  • Kohya
  • WebUI

Contents

Abstract

We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully designed the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-round multi-modal dialogue with users, generating and refining images according to the context. Through our carefully designed holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models.

🎉 Hunyuan-DiT Key Features

Chinese-English Bilingual DiT Architecture

Hunyuan-DiT is a diffusion model in the latent space, as depicted in figure below. Following the Latent Diffusion Model, we use a pre-trained Variational Autoencoder (VAE) to compress the images into low-dimensional latent spaces and train a diffusion model to learn the data distribution with diffusion models. Our diffusion model is parameterized with a transformer. To encode the text prompts, we leverage a combination of pre-trained bilingual (English and Chinese) CLIP and multilingual T5 encoder.

Multi-turn Text2Image Generation

Understanding natural language instructions and performing multi-turn interaction with users are important for a text-to-image system. It can help build a dynamic and iterative creation process that bring the user’s idea into reality step by step. In this section, we will detail how we empower Hunyuan-DiT with the ability to perform multi-round conversations and image generation. We train MLLM to understand the multi-round user dialogue and output the new text prompt for image generation.

📈 Comparisons

In order to comprehensively compare the generation capabilities of HunyuanDiT and other models, we constructed a 4-dimensional test set, including Text-Image Consistency, Excluding AI Artifacts, Subject Clarity, Aesthetic. More than 50 professional evaluators performs the evaluation.

ModelOpen SourceText-Image Consistency (%)Excluding AI Artifacts (%)Subject Clarity (%)Aesthetics (%)Overall (%)
SDXL64.360.691.176.342.7
PixArt-α68.360.993.277.545.5
Playground 2.571.970.894.983.354.3
SD 377.169.394.682.556.7
MidJourney v673.580.293.587.263.3
DALL-E 383.980.396.589.471.0
Hunyuan-DiT74.274.395.486.659.0

🎥 Visualization

  • Chinese Elements

  • Long Text Input

  • **Multi-turn Text2Image
项目侧边栏1项目侧边栏2
推荐项目
Project Cover

豆包MarsCode

豆包 MarsCode 是一款革命性的编程助手,通过AI技术提供代码补全、单测生成、代码解释和智能问答等功能,支持100+编程语言,与主流编辑器无缝集成,显著提升开发效率和代码质量。

Project Cover

AI写歌

Suno AI是一个革命性的AI音乐创作平台,能在短短30秒内帮助用户创作出一首完整的歌曲。无论是寻找创作灵感还是需要快速制作音乐,Suno AI都是音乐爱好者和专业人士的理想选择。

Project Cover

有言AI

有言平台提供一站式AIGC视频创作解决方案,通过智能技术简化视频制作流程。无论是企业宣传还是个人分享,有言都能帮助用户快速、轻松地制作出专业级别的视频内容。

Project Cover

Kimi

Kimi AI助手提供多语言对话支持,能够阅读和理解用户上传的文件内容,解析网页信息,并结合搜索结果为用户提供详尽的答案。无论是日常咨询还是专业问题,Kimi都能以友好、专业的方式提供帮助。

Project Cover

阿里绘蛙

绘蛙是阿里巴巴集团推出的革命性AI电商营销平台。利用尖端人工智能技术,为商家提供一键生成商品图和营销文案的服务,显著提升内容创作效率和营销效果。适用于淘宝、天猫等电商平台,让商品第一时间被种草。

Project Cover

吐司

探索Tensor.Art平台的独特AI模型,免费访问各种图像生成与AI训练工具,从Stable Diffusion等基础模型开始,轻松实现创新图像生成。体验前沿的AI技术,推动个人和企业的创新发展。

Project Cover

SubCat字幕猫

SubCat字幕猫APP是一款创新的视频播放器,它将改变您观看视频的方式!SubCat结合了先进的人工智能技术,为您提供即时视频字幕翻译,无论是本地视频还是网络流媒体,让您轻松享受各种语言的内容。

Project Cover

美间AI

美间AI创意设计平台,利用前沿AI技术,为设计师和营销人员提供一站式设计解决方案。从智能海报到3D效果图,再到文案生成,美间让创意设计更简单、更高效。

Project Cover

AIWritePaper论文写作

AIWritePaper论文写作是一站式AI论文写作辅助工具,简化了选题、文献检索至论文撰写的整个过程。通过简单设定,平台可快速生成高质量论文大纲和全文,配合图表、参考文献等一应俱全,同时提供开题报告和答辩PPT等增值服务,保障数据安全,有效提升写作效率和论文质量。

投诉举报邮箱: service@vectorlightyear.com
@2024 懂AI·鲁ICP备2024100362号-6·鲁公网安备37021002001498号