Mistral-Nemo-Instruct-2407-FP8 - FP8量化技术在模型优化与部署中的应用

项目简介：Mistral-Nemo-Instruct-2407-FP8

模型概览

Mistral-Nemo-Instruct-2407-FP8是一款先进的语言模型，由Neural Magic公司开发，旨在为商业和研究领域提供创新的英语语言交流能力。

模型架构：基于Mistral-Nemo架构，能够处理文本输入和输出。
模型优化：
- 权重量化：采用FP8格式
- 激活量化：同样采用FP8格式
预期用例：主要用于类助手的聊天领域，类似于Meta-Llama-3-8B-Instruct模型。
不适用范围：不可用于违法活动或违反相关法规的用途，以及非英语语言的应用。
发布日期：2024年7月18日
版本：1.0
许可证：Apache 2.0

该模型是通过对Mistral-Nemo-Instruct-2407进行量化处理而得，相较于未量化版本的平均得分71.61，本模型在OpenLLM基准测试中达到71.28。

模型优化

通过将Mistral-Nemo-Instruct-2407的权重和激活量化为FP8格式而生成此模型，适用于vLLM >= 0.5.0。此优化将每个参数的位数从16减少到8，降低了50%的磁盘空间和GPU内存需求。

量化仅涉及变压器块中的线性算子权重和激活，并采用对称每张量量化方法，使用单一线性缩放将量化的FP8权重和激活映射。模型使用AutoFP8进行量化，包含512个UltraChat序列。

部署

使用vLLM

此模型可高效地通过vLLM后端进行部署，示例如下：

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/Mistral-Nemo-Instruct-2407-FP8"

sampling_params = SamplingParams(temperature=0.3, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False)

llm = LLM(model=model_id, max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

此外，vLLM还能兼容OpenAI的服务，详细信息可查阅vLLM文档。

创建过程

此模型通过应用AutoFP8与UltraChat校准样本生成，目前Neural Magic正过渡到使用支持更多量化方案的llm-compressor。需要注意的是，transformers需从源代码构建。

from datasets import load_dataset
from transformers import AutoTokenizer

from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "mistralai/Mistral-Nemo-Instruct-2407"
quantized_model_dir = "Mistral-Nemo-Instruct-2407-FP8"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, model_max_length=4096)
tokenizer.pad_token = tokenizer.eos_token

ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")

quantize_config = BaseQuantizeConfig(
    quant_method="fp8",
    activation_scheme="static"
    ignore_patterns=["re:.*lm_head"],
)

model = AutoFP8ForCausalLM.from_pretrained(
    pretrained_model_dir, quantize_config=quantize_config
)

model.quantize(examples)
model.save_quantized(quantized_model_dir)

评估

模型在OpenLLM排行榜任务中进行了评估，使用了lm-evaluation-harness和vLLM 引擎。需要注意的是，vllm也需从源代码构建。

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Mistral-Nemo-Instruct-2407-FP8",dtype=auto,gpu_memory_utilization=0.4,max_model_len=4096 \
  --tasks openllm \
  --batch_size auto

准确性

Open LLM排行榜评价得分

基准	Mistral-Nemo-Instruct-2407	Mistral-Nemo-Instruct-2407-FP8（本模型）	恢复率
MMLU (5-shot)	68.35	68.50	100.2%
ARC Challenge (25-shot)	65.53	64.68	98.70%
GSM-8K (5-shot, strict-match)	74.45	73.01	98.06%
Hellaswag (10-shot)	84.32	84.18	99.83%
Winogrande (5-shot)	82.16	82.32	100.1%
TruthfulQA (0-shot)	54.85	54.96	100.2%
平均	71.61	71.28	99.53%