unsloth

调优Llama 3.1、Mistral、Phi-3和Gemma，速度提升2-5倍，内存减少80%！

✨ 免费调优

所有笔记本均为初学者友好！添加您的数据集，点击“运行全部”，您将获得2倍速度提升的调优模型，可以导出为GGUF、Ollama、vLLM或上传到Hugging Face。

Unsloth 支持的模型	免费笔记本	性能	内存使用
Llama 3.1 (8B)	▶️ 免费开始	速度提升2倍	内存减少60%
Mistral Nemo (12B)	▶️ 免费开始	速度提升2倍	内存减少60%
Gemma 2 (9B)	▶️ 免费开始	速度提升2倍	内存减少63%
Phi-3 (mini)	▶️ 免费开始	速度提升2倍	内存减少50%
Ollama	▶️ 免费开始	速度提升1.9倍	内存减少43%
Mistral v0.3 (7B)	▶️ 免费开始	速度提升2.2倍	内存减少73%
ORPO	▶️ 免费开始	速度提升1.9倍	内存减少43%
DPO Zephyr	▶️ 免费开始	速度提升1.9倍	内存减少43%
TinyLlama	▶️ 免费开始	速度提升3.9倍	内存减少74%

Kaggle 笔记本 Llama 3.1 (8B), Gemma 2 (9B), Mistral (7B)
运行 Llama 3 对话笔记本和 Mistral v0.3 ChatML
此文本补全笔记本适用于继续预训练/原始文本
此继续预训练笔记本适用于学习另一种语言
点击此处查看Unsloth的详细文档。

🦥 Unsloth.ai 新闻

📣 新！pip install unsloth 现在可以用了！前往pypi查看！这允许非 git pull 安装。使用pip install unsloth[colab-new]进行无依赖安装。
📣 新！现已支持Gemma-2-2b！Gemma-2-9b 和 Gemma-2-27b 已经支持！并上传了GGUF 量化模型试试 Gemma-2-2b Instruct 的聊天界面！
📣 新！现已支持Llama 3.1 8b, 70b 基础版和Instruct版
📣 新！现已支持Mistral Nemo-12b 基础版和Instruct版
📣 新！现已支持Gemma-2-9b 和 Gemma-2-27b
📣 更新！[Phi-3 mini](https://colab.research.google.com/drive/1hhdhBa1j_hsymi

⭐ 关键特性

所有内核均使用OpenAI的Triton语言编写。手动反向传播引擎。
0% 精度损失 - 无近似方法 - 全部精确计算。
不需要更换硬件。支持2018年后的NVIDIA GPU。最低要求CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40等) 检查你的GPU! GTX 1070, 1080也可以使用，但速度较慢。
支持Linux和Windows（通过WSL）。
通过bitsandbytes支持4bit和16bit QLoRA / LoRA微调。
开源版本训练速度提升5倍 - 访问Unsloth Pro可实现30倍加速！
如果你使用🦥Unsloth训练了模型，可以使用这个酷炫的贴纸！

🥇 性能基准测试

完整的可复现基准测试表，详见我们的网站

1 A100 40GB	🤗Hugging Face	Flash Attention	🦥Unsloth开源	🦥Unsloth Pro
Alpaca	1x	1.04x	1.98x	15.64x
LAION Chip2	1x	0.92x	1.61x	20.73x
OASST	1x	1.19x	2.17x	14.83x
Slim Orca	1x	1.18x	2.22x	14.82x

下面的基准测试表由🤗Hugging Face进行。

免费Colab T4	数据集	🤗Hugging Face	Pytorch 2.1.1	🦥Unsloth	🦥 VRAM减少
Llama-2 7b	OASST	1x	1.19x	1.95x	-43.3%
Mistral 7b	Alpaca	1x	1.07x	1.56x	-13.7%
Tiny Llama 1.1b	Alpaca	1x	2.06x	3.87x	-73.8%
DPO with Zephyr	Ultra Chat	1x	1.09x	1.55x	-18.6%

💾 安装说明

如果你有Pytorch 2.3和CUDA 12.1，使用pip install unsloth[colab-new]安装Unsloth，然后pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes。

Conda安装

选择pytorch-cuda=11.8用于CUDA 11.8，或选择pytorch-cuda=12.1用于CUDA 12.1。如果你有mamba，使用mamba代替conda以加快求解速度。请参考此Github issue以获取Conda安装的调试帮助。

conda create --name unsloth_env \
    python=3.10 \
    pytorch-cuda=<11.8/12.1> \
    pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers \
    -y
conda activate unsloth_env

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes

Pip安装

如果你使用的是Anaconda，请不要使用此方法。你必须使用Conda安装方法，否则会出错。

通过以下代码查找你的CUDA版本：

import torch; torch.version.cuda

对于Pytorch 2.1.0：你可以通过Pip更新Pytorch（cu121/cu118可互换）。访问https://pytorch.org/了解更多信息。选择`cu118`用于CUDA 11.8，选择cu121用于CUDA 12.1。如果你拥有RTX 3060或更高版本（A100, H100等），使用"ampere"路径。对于Pytorch 2.1.1：请前往步骤3。对于Pytorch 2.2.0：请前往步骤4。

pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.0 triton \
  --index-url https://download.pytorch.org/whl/cu121

pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere] @ git+https://github.com/unslothai/unsloth.git"

对于Pytorch 2.1.1：为新的RTX 30xx或更高的GPU使用"ampere"路径。

pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.1 triton \
  --index-url https://download.pytorch.org/whl/cu121

pip install "unsloth[cu118-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"

对于Pytorch 2.2.0：为新的RTX 30xx或更高的GPU使用"ampere"路径。

pip install --upgrade --force-reinstall --no-cache-dir torch==2.2.0 triton \
  --index-url https://download.pytorch.org/whl/cu121

pip install "unsloth[cu118-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"

如果出现错误，先尝试以下命令，然后返回步骤1：

pip install --upgrade pip

对于Pytorch 2.2.1：

# RTX 3090, 4090 Ampere GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes

# Pre Ampere RTX 2080, T4, GTX 1080 GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/uns
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# 进行模型修补并添加快速LoRA权重
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # 支持任何值，但=0已优化
    bias = "none",    # 支持任何值，但="none"已优化
    # [新] "unsloth" 使用30%更少的显存，可适应2倍更大的批量大小！
    use_gradient_checkpointing = "unsloth", # 对于超长上下文，设置为True或"unsloth"
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # 我们支持Rank Stabilized LoRA
    loftq_config = None, # 以及LoftQ
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()

# 前往 https://github.com/unslothai/unsloth/wiki 获取高级提示，例如
# (1) 保存为GGUF / 合并到16bit以用于vLLM
# (2) 从保存的LoRA适配器继续训练
# (3) 添加评估循环 / OOMs
# (4) 定制化聊天模板

<a name="DPO"></a>
## DPO支持
DPO（直接偏好优化）、PPO、奖励建模在[Llama-Factory](https://github.com/hiyouga/LLaMA-Factory)的第三方独立测试中都表现良好。我们有一个用于在Tesla T4上复现Zephyr的初步Google Colab笔记本：[notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)。

我们已在🤗Hugging Face的官方文档中！我们在[SFT文档](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)和[DPO文档](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth)中都有提到！

```python
from unsloth import FastLanguageModel, PatchDPOTrainer
from unsloth import is_bfloat16_supported
PatchDPOTrainer()
import torch
from transformers import TrainingArguments
from trl import DPOTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/zephyr-sft-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# 进行模型修补并添加快速LoRA权重
model = FastLanguageModel.get_peft_model(
    model,
    r = 64,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # 支持任何值，但=0已优化
    bias = "none",    # 支持任何值，但="none"已优化
    # [新] "unsloth" 使用30%更少的显存，可适应2倍更大的批量大小！
    use_gradient_checkpointing = "unsloth", # 对于超长上下文，设置为True或"unsloth"
    random_state = 3407,
    max_seq_length = max_seq_length,
)

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 8,
        warmup_ratio = 0.1,
        num_train_epochs = 3,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = YOUR_DATASET_HERE,
    # eval_dataset = YOUR_DATASET_HERE,
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)
dpo_trainer.train()

🥇 详细基准测试表

点击“代码”以获取完全可复现的示例
“Unsloth Equal”是我们PRO版的预览，代码已删除。所有设置和损失曲线保持相同。
完整的基准测试表列表，请访问我们的网站

1 A100 40GB	🤗Hugging Face	Flash Attention 2	🦥Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	1.04x	1.98x	2.48x	5.32x	15.64x
代码	代码	代码	代码	代码
秒	1040	1001	525	419	196	67
内存MB	18235	15365	9631	8525
% 节省		15.74	47.18	53.25

Llama-Factory 第三方基准测试

性能表链接 TGS：每GPU每秒处理的Token数量。模型：LLaMA2-7B。GPU：NVIDIA A100 * 1。批量大小：4。梯度累积：2。LoRA等级：8。最大长度：1024。

方法	位数	TGS	GRAM	速度
HF	16	2392	18GB	100%
HF+FA2	16	2954	17GB	123%
Unsloth+FA2	16	4007	16GB	168%
HF	4	2415	9GB	101%
Unsloth+FA2	4	3726	7GB	160%

流行模型之间的性能比较

点击查看特定模型的基准测试表 (Mistral 7b, CodeLlama 34b等)

Mistral 7b

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Mistral 7B Slim Orca	1x	1.15x	2.15x	2.53x	4.61x	13.69x
代码	[代码](https://colab.research.google.com/drive/1mePk3KzwTD81hr5mcNcs_AX3Kbg
1 T4 16GB	Hugging Face	Flash Attention	Unsloth Open	Unsloth Pro Equal	Unsloth Pro	Unsloth Max
--------------	-------------	-----------------	-----------------	---------------	---------------	-------------
Alpaca	1x	1.09x	1.69x	1.79x	2.93x	8.3x
代码	▶️ 代码	代码	代码	代码
秒数	1599	1468	942	894	545	193
内存 MB	7199	7059	6459	5443
节省百分比		1.94	10.28	24.39

2 个 Tesla T4s 通过 DDP

2 T4 DDP	Hugging Face	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	0.99x	4.95x	4.44x	7.28x	20.61x
代码	▶️ 代码	代码	代码
秒数	9882	9946	1996	2227	1357	480
内存 MB	9176	9128	6904	6782
节省百分比		0.52	24.76	26.09

在 1 台 Tesla T4 GPU 上的性能对比:

点击查看 1 轮次的时间消耗

Google Colab 上的一台 Tesla T4 bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

系统	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K)
Huggingface	1 T4	23h 15m	56h 28m	8h 38m	391h 41m
Unsloth Open	1 T4	13h 7m (1.8x)	31h 47m (1.8x)	4h 27m (1.9x)	240h 4m (1.6x)
Unsloth Pro	1 T4	3h 6m (7.5x)	5h 17m (10.7x)	1h 7m (7.7x)	59h 53m (6.5x)
Unsloth Max	1 T4	2h 39m (8.8x)	4h 31m (12.5x)	0h 58m (8.9x)	51h 30m (7.6x)

峰值内存使用

系统	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K)
Huggingface	1 T4	7.3GB	5.9GB	14.0GB	13.3GB
Unsloth Open	1 T4	6.8GB	5.7GB	7.8GB	7.7GB
Unsloth Pro	1 T4	6.4GB	6.4GB	6.4GB	6.4GB
Unsloth Max	1 T4	11.4GB	12.4GB	11.9GB	14.4GB

点击查看通过 DDP 的 2 台 Tesla T4 GPU 的性能对比：

**1 轮次的时间消耗**

Kaggle 上的两台 Tesla T4 bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

系统	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K) *
Huggingface	2 T4	84h 47m	163h 48m	30h 51m	1301h 24m *
Unsloth Pro	2 T4	3h 20m (25.4x)	5h 43m (28.7x)	1h 12m (25.7x)	71h 40m (18.1x) *
Unsloth Max	2 T4	3h 4m (27.6x)	5h 14m (31.3x)	1h 6m (28.1x)	54h 20m (23.9x) *

多 GPU 系统 (2 个 GPU) 的峰值内存使用

系统	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K) *
Huggingface	2 T4	8.4GB \| 6GB	7.2GB \| 5.3GB	14.3GB \| 6.6GB	10.9GB \| 5.9GB *
Unsloth Pro	2 T4	7.7GB \| 4.9GB	7.5GB \| 4.9GB	8.5GB \| 4.9GB	6.2GB \| 4.7GB *
Unsloth Max	2 T4	10.5GB \| 5GB	10.6GB \| 5GB	10.6GB \| 5GB	10.5GB \| 5GB *