AutoFP8
开源FP8量化库,用于生成可在vLLM中运行的压缩检查点 - 有关推理实现的详细信息,请参见 https://github.com/vllm-project/vllm/pull/4332。该库专注于为FP8_E4M3精度提供量化的权重、激活和kv缓存尺度。
来自Neural Magic的FP8模型集合,包含许多准确的(精度下降<1%)FP8检查点,可直接用于vLLM推理。
注意:AutoFP8处于早期测试阶段,可能会发生变化
安装
克隆此仓库并从源代码安装:
git clone https://github.com/neuralmagic/AutoFP8.git
pip install -e AutoFP8
稳定版本将会发布。
快速开始
该包引入了AutoFP8ForCausalLM
和BaseQuantizeConfig
对象,用于管理模型的压缩方式。
加载AutoFP8ForCausalLM
后,您可以对数据进行标记化处理,并将其提供给model.quantize(tokenized_text)
函数以校准和压缩模型。
最后,您可以使用model.save_quantized("my_model_fp8")
将量化后的模型保存为与vLLM兼容的压缩检查点格式。
以下是涵盖整个流程的完整示例:
from transformers import AutoTokenizer
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = ["auto_fp8 is an easy-to-use model quantization library"]
examples = tokenizer(examples, return_tensors="pt").to("cuda")
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic")
model = AutoFP8ForCausalLM.from_pretrained(
pretrained_model_dir, quantize_config=quantize_config
)
model.quantize(examples)
model.save_quantized(quantized_model_dir)
最后,将其加载到vLLM中进行推理!支持从v0.4.2版本开始(pip install vllm>=0.4.2
)。请注意,您使用的GPU必须具有FP8张量核心的硬件支持(Ada Lovelace、Hopper及更新的架构)。
from vllm import LLM
model = LLM("Meta-Llama-3-8B-Instruct-FP8")
# INFO 05-10 18:02:40 model_runner.py:175] Loading model weights took 8.4595 GB
print(model.generate("Once upon a time"))
# [RequestOutput(request_id=0, prompt='Once upon a time', prompt_token_ids=[128000, 12805, 5304, 264, 892], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' there was a man who fell in love with a woman. The man was so', token_ids=[1070, 574, 264, 893, 889, 11299, 304, 3021, 449, 264, 5333, 13, 578, 893, 574, 779], cumulative_logprob=-21.314169232733548, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715378569.478381, last_token_time=1715378569.478381, first_scheduled_time=1715378569.480648, first_token_time=1715378569.7070432, time_in_queue=0.002267122268676758, finished_time=1715378570.104807), lora_request=None)]
如何运行FP8量化模型
vLLM完全支持使用此包量化的FP8模型。安装vLLM:pip install vllm>=0.4.2
然后直接将量化后的检查点传递给vLLM的入口点!它将使用config.json
中的quantization_config
来检测检查点格式。
from vllm import LLM
model = LLM("neuralmagic/Meta-Llama-3-8B-Instruct-FP8")
# INFO 05-06 10:06:23 model_runner.py:172] Loading model weights took 8.4596 GB
outputs = model.generate("Once upon a time,")
print(outputs[0].outputs[0].text)
# ' there was a beautiful princess who lived in a far-off kingdom. She was kind'
检查点结构说明
以下是fp8检查点的实验结构详细说明。
以下内容将添加到config.json中
"quantization_config": {
"quant_method": "fp8",
"activation_scheme": "static" or "dynamic"
},
state_dict中的每个量化层将包含:
如果配置为"activation_scheme": "static"
:
model.layers.0.mlp.down_proj.weight < F8_E4M3
model.layers.0.mlp.down_proj.input_scale < F32
model.layers.0.mlp.down_proj.weight_scale < F32
如果配置为"activation_scheme": "dynamic"
:
model.layers.0.mlp.down_proj.weight < F8_E4M3
model.layers.0.mlp.down_proj.weight_scale < F32