DeepSeek-Coder-V2-Lite-Instruct-FP8 - FP8量化模型优化提高大语言模型部署效率

DeepSeek-Coder-V2-Lite-Instruct-FP8项目简介

模型概述

DeepSeek-Coder-V2-Lite-Instruct-FP8是一个强大的文本生成模型，主要用于商业和研究领域中的英语语言处理。与Meta-Llama-3-7B-Instruct类似，该模型主要用作助手类的聊天机器人。

模型架构：DeepSeek-Coder-V2-Lite-Instruct
输入：文本
输出：文本
模型优化：
- 权重量化：FP8
- 激活量化：FP8
不适用范围：无法用于违反相关法律法规的用途，也不支持非英语语言的应用。
发布时间：2024年7月18日
开发者：Neural Magic
版本：1.0
许可证：deepseek-license

该模型是DeepSeek-Coder-V2-Lite-Instruct的量化版本，在HumanEval+基准测试中，量化后的模型得分为79.60，而未量化的版本得分为79.33。

模型优化

此模型通过将DeepSeek-Coder-V2-Lite-Instruct的权重和激活量化为FP8数据类型而来。这样的优化使得每个参数的位数从16减少到8，从而减少了大约50%的磁盘空间和GPU内存需求。

具体来说，只有变压器块中的线性操作权重和激活进行量化。采用对称的张量量化，即通过单一线性缩放来映射量化后的FP8表示。量化过程使用了包含512个UltraChat序列的AutoFP8工具进行。

部署方式

使用vLLM后端可以高效地部署此模型。以下是一个示例：

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/DeepSeek-Coder-V2-Lite-Instruct-FP8"

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

llm = LLM(model=model_id, trust_remote_code=True, max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM还支持与OpenAI兼容的服务，详细信息见vLLM文档。

模型创建

该模型通过应用AutoFP8与Ultrachat中的校准样本创建而成。虽然具体模型使用的是AutoFP8，Neural Magic正在转向使用支持多种量化方案的llm-compressor工具。

from datasets import load_dataset
from transformers import AutoTokenizer
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"
quantized_model_dir = "DeepSeek-Coder-V2-Lite-Instruct-FP8"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, model_max_length=4096)
tokenizer.pad_token = tokenizer.eos_token

ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")

quantize_config = BaseQuantizeConfig(
    quant_method="fp8",
    activation_scheme="static"
    ignore_patterns=["re:.*lm_head"],
)

model = AutoFP8ForCausalLM.from_pretrained(
    pretrained_model_dir, quantize_config=quantize_config
)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

模型评估

该模型在HumanEval+基准测试上进行了评估，使用的是由Neural Magic分叉的EvalPlus实现，以及vLLM引擎。

HumanEval+ 评价得分

在HumanEval+基准测试中，该模型的表现如下：

Benchmark	DeepSeek-Coder-V2-Lite-Instruct	DeepSeek-Coder-V2-Lite-Instruct-FP8	Recovery
base pass@1	80.8	79.3	98.14%
base pass@10	83.4	84.6	101.4%
base+extra pass@1	75.8	74.9	98.81%
base+extra pass@10	77.3	79.6	102.9%
平均	79.33	79.60	100.3%

通过上述介绍，我们可以看到DeepSeek-Coder-V2-Lite-Instruct-FP8模型在保持高性能的同时，大大减少了硬件资源需求，是一个值得在多种应用中探索的优秀模型。