llama2.rs 🤗

这是一个在CPU上进行Llama2推理的Rust实现

目标是尽可能快。

它具有以下特性：

支持4位GPT-Q量化
批量预填充提示词
支持SIMD以实现快速CPU推理
内存映射，可即时加载70B模型
静态大小检查以确保安全性
支持分组查询注意力（大型Llama模型所需）
Python调用API

可以在70B Llama2上运行1 tok/s，在7B Llama2上运行9 tok/s。（在我的英特尔i9台式机上）

要构建，你需要nightly工具链，默认使用：

> rustup toolchain install nightly # 获取nightly版本
> ulimit -s 10000000 # 增加你的堆栈内存限制

你可以从Hugging Face hub加载模型。例如，这里创建了一个70B量化模型的版本，使用4位量化和64大小的组：

> pip install -r requirements.export.txt
> python export.py l70b.act64.bin TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ gptq-4bit-64g-actorder_True

库需要重新编译以匹配模型。你可以使用cargo来完成这个操作。

运行：

> cargo run --release --features 70B,group_64,quantized -- -c llama2-70b-q.bin -t 0.0 -s 11 -p "The only thing"                                                                                                                                 
The only thing that I can think of is that the          
achieved tok/s: 0.89155835

老实说，在我的GPU机器上运行，这个速度还不错，明显比llama.c快。

这里是13B量化模型的运行：

> cargo run --release --features 13B,group_128,quantized -- -c l13orca.act.bin -t 0.0 -s 25 -p "Hello to all the cool people out there who "
Hello to all the cool people out there who are reading this. I hope you are having a great day. I am here
achieved tok/s: 5.1588936

这里是7B量化模型的运行：

cargo run --release --features 7B,group_128,quantized -- -c l7.ack.bin -t 0.0 -s 25 -p "Hello to all the cool people out there who "
> Hello to all the cool people out there who are reading this. I am a newbie here and I am looking for some
achieved tok/s: 9.048136

Python

要在Python中运行，你首先需要从主目录使用python标志进行编译。

cargo build --release --features 7B,group_128,quantized,python
pip install .

然后你可以运行以下代码。

import llama2_rs

def test_llama2_13b_4_128act_can_generate():
    model = llama2_rs.LlamaModel("lorca13b.act132.bin", False)
    tokenizer = llama2_rs.Tokenizer("tokenizer.bin")
    random = llama2_rs.Random()
    response = llama2_rs.generate(
        model,
        tokenizer,
        "Tell me zero-cost abstractions in Rust ",
        50,
        random, 
        0.0
    )