AutoFP8

开源FP8量化库，用于生成可在vLLM中运行的压缩检查点 - 有关推理实现的详细信息，请参见 https://github.com/vllm-project/vllm/pull/4332。该库专注于为FP8_E4M3精度提供量化的权重、激活和kv缓存尺度。

来自Neural Magic的FP8模型集合，包含许多准确的（精度下降<1%）FP8检查点，可直接用于vLLM推理。

注意：AutoFP8处于早期测试阶段，可能会发生变化

安装

克隆此仓库并从源代码安装：

git clone https://github.com/neuralmagic/AutoFP8.git
pip install -e AutoFP8

稳定版本将会发布。

快速开始

该包引入了AutoFP8ForCausalLM和BaseQuantizeConfig对象，用于管理模型的压缩方式。

加载AutoFP8ForCausalLM后，您可以对数据进行标记化处理，并将其提供给model.quantize(tokenized_text)函数以校准和压缩模型。

最后，您可以使用model.save_quantized("my_model_fp8")将量化后的模型保存为与vLLM兼容的压缩检查点格式。

以下是涵盖整个流程的完整示例：

from transformers import AutoTokenizer
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = ["auto_fp8 is an easy-to-use model quantization library"]
examples = tokenizer(examples, return_tensors="pt").to("cuda")

quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic")

model = AutoFP8ForCausalLM.from_pretrained(
    pretrained_model_dir, quantize_config=quantize_config
)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

最后，将其加载到vLLM中进行推理！支持从v0.4.2版本开始（pip install vllm>=0.4.2）。请注意，您使用的GPU必须具有FP8张量核心的硬件支持（Ada Lovelace、Hopper及更新的架构）。

from vllm import LLM

model = LLM("Meta-Llama-3-8B-Instruct-FP8")
# INFO 05-10 18:02:40 model_runner.py:175] Loading model weights took 8.4595 GB

print(model.generate("Once upon a time"))
# [RequestOutput(request_id=0, prompt='Once upon a time', prompt_token_ids=[128000, 12805, 5304, 264, 892], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' there was a man who fell in love with a woman. The man was so', token_ids=[1070, 574, 264, 893, 889, 11299, 304, 3021, 449, 264, 5333, 13, 578, 893, 574, 779], cumulative_logprob=-21.314169232733548, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715378569.478381, last_token_time=1715378569.478381, first_scheduled_time=1715378569.480648, first_token_time=1715378569.7070432, time_in_queue=0.002267122268676758, finished_time=1715378570.104807), lora_request=None)]

如何运行FP8量化模型

vLLM完全支持使用此包量化的FP8模型。安装vLLM：pip install vllm>=0.4.2

然后直接将量化后的检查点传递给vLLM的入口点！它将使用config.json中的quantization_config来检测检查点格式。

from vllm import LLM
model = LLM("neuralmagic/Meta-Llama-3-8B-Instruct-FP8")
# INFO 05-06 10:06:23 model_runner.py:172] Loading model weights took 8.4596 GB

outputs = model.generate("Once upon a time,")
print(outputs[0].outputs[0].text)
# ' there was a beautiful princess who lived in a far-off kingdom. She was kind'

检查点结构说明

以下是fp8检查点的实验结构详细说明。

以下内容将添加到config.json中

"quantization_config": {
    "quant_method": "fp8",
    "activation_scheme": "static" or "dynamic"
  },

state_dict中的每个量化层将包含：

如果配置为"activation_scheme": "static"：

model.layers.0.mlp.down_proj.weight              < F8_E4M3
model.layers.0.mlp.down_proj.input_scale         < F32
model.layers.0.mlp.down_proj.weight_scale        < F32

如果配置为"activation_scheme": "dynamic"：

model.layers.0.mlp.down_proj.weight              < F8_E4M3
model.layers.0.mlp.down_proj.weight_scale        < F32