Qwen2-1.5B-Instruct-AWQ - 探索具备多语言能力和高性能的新一代语言模型

项目介绍：Qwen2-1.5B-Instruct-AWQ

项目背景

Qwen2-1.5B-Instruct-AWQ 是 Qwen2 系列大型语言模型中的一员。Qwen2 系列模型包括从 0.5 亿到 720 亿参数的多种基础语言模型和指令调优语言模型，其中还包括一个专家混合模型。在这个项目中，提供了经过指令调优的 15 亿参数的 Qwen2 模型。

相较于当前最先进的开源语言模型，包括之前发布的 Qwen1.5，Qwen2 系列在语言理解、语言生成、多语言能力、编程、数学以及推理等多个基准测试中表现优异，不仅超越了大多数开源模型，还在一定程度上与专有模型竞争。

模型细节

Qwen2 系列是基于 Transformer 架构的语言模型，采用了SwiGLU激活函数、注意力QKV偏置、组合查询注意力等技术。模型进行了多种尺寸的解码器语言模型发布，对于每种尺寸，都会发布基础语言模型和对齐的聊天模型。此外，Qwen2 拥有经过改进的分词器，适应多种自然语言和代码。

训练细节

我们使用大量数据预训练了这些模型，并通过监督微调和直接偏好优化对模型进行了后续训练。

系统需求

Qwen2 的代码已经集成在最新版本的 Hugging Face Transformers 库中。建议将 transformers 库升级到 4.37.0 或更高版本，否则可能会遇到如下错误：

KeyError: 'qwen2'

快速上手指南

以下代码片段展示了如何加载分词器和模型以及如何生成文本内容：

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-1.5B-Instruct-AWQ",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-1.5B-Instruct-AWQ")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

基准测试和速度

对于 bfloat16 (bf16) 和量化模型如 GPTQ-Int8、GPTQ-Int4、AWQ 的生成性能比较，请参考我们的量化模型基准测试。该基准测试提供了不同量化技术如何影响模型性能的见解。

对于对这些模型在使用 transformer 或 vLLM 部署时的推理速度和内存消耗感兴趣的人，我们编制了详细的速度基准测试。

引用

如果您觉得我们的工作有用，欢迎引用我们的研究：

@article{qwen2,
  title={Qwen2 Technical Report},
  year={2024}
}