llama2.rs: 纯Rust实现的高性能Llama 2推理引擎

llama2.rs

llama2.rs: 纯Rust实现的高性能Llama 2推理引擎

llama2.rs是一个由Rust语言编写的Llama 2推理引擎,旨在提供最快速的CPU推理性能。该项目由GitHub用户srush和rachtsingh开发,是对Andrej Karpathy的llama2.c项目的Rust重新实现和扩展。

主要特性

llama2.rs具有以下几个显著特点:

支持4位GPTQ量化,大幅降低模型内存占用
批量预填充prompt tokens,提高推理效率
利用SIMD指令集加速CPU推理
内存映射技术,实现70B参数模型的瞬时加载
静态大小检查,提高内存安全性
支持分组查询注意力(GQA),适用于大规模Llama模型
提供Python调用API

在作者的Intel i9台式机上,llama2.rs可以实现70B参数Llama 2模型1 token/s的推理速度,7B参数模型则可达到9 token/s,相比原始的llama2.c有显著提升。

llama cartoon

使用方法

要使用llama2.rs,首先需要安装Rust的nightly工具链:

rustup toolchain install nightly
ulimit -s 10000000 # 增加栈内存限制

然后可以从Hugging Face hub下载预训练模型。例如,以下命令将创建一个70B参数、4位量化、64组大小的模型:

pip install -r requirements.export.txt
python export.py l70b.act64.bin TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ gptq-4bit-64g-actorder_True

使用cargo编译并运行模型:

cargo run --release --features 70B,group_64,quantized -- -c llama2-70b-q.bin -t 0.0 -s 11 -p "The only thing"

这将输出生成的文本和推理速度。

Python接口

llama2.rs也提供了Python接口。首先需要编译并安装Python包:

cargo build --release --features 7B,group_128,quantized,python
pip install .

然后可以在Python中使用:

import llama2_rs

model = llama2_rs.LlamaModel("lorca13b.act132.bin", False)
tokenizer = llama2_rs.Tokenizer("tokenizer.bin")
random = llama2_rs.Random()
response = llama2_rs.generate(
    model,
    tokenizer,
    "Tell me zero-cost abstractions in Rust ",
    50,
    random, 
    0.0
)