GPU-Benchmarks-on-LLM-Inference

Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference? 🧐

Description

Use llama.cpp to test the LLaMA models inference speed of different GPUs on RunPod, 13-inch M1 MacBook Air, 14-inch M1 Max MacBook Pro, M2 Ultra Mac Studio and 16-inch M3 Max MacBook Pro for LLaMA 3.

Overview

Average speed (tokens/s) of generating 1024 tokens by GPUs on LLaMA 3. Higher speed is better.

GPU	8B Q4_K_M	8B F16	70B Q4_K_M	70B F16
3070 8GB	70.94	OOM	OOM	OOM
3080 10GB	106.40	OOM	OOM	OOM
3080 Ti 12GB	106.71	OOM	OOM	OOM
4070 Ti 12GB	82.21	OOM	OOM	OOM
4080 16GB	106.22	40.29	OOM	OOM
RTX 4000 Ada 20GB	58.59	20.85	OOM	OOM
3090 24GB	111.74	46.51	OOM	OOM
4090 24GB	127.74	54.34	OOM	OOM
RTX 5000 Ada 32GB	89.87	32.67	OOM	OOM
3090 24GB * 2	108.07	47.15	16.29	OOM
4090 24GB * 2	122.56	53.27	19.06	OOM
RTX A6000 48GB	102.22	40.25	14.58	OOM
RTX 6000 Ada 48GB	130.99	51.97	18.36	OOM
A40 48GB	88.95	33.95	12.08	OOM
L40S 48GB	113.60	43.42	15.31	OOM
RTX 4000 Ada 20GB * 4	56.14	20.58	7.33	OOM
A100 PCIe 80GB	138.31	54.56	22.11	OOM
A100 SXM 80GB	133.38	53.18	24.33	OOM
H100 PCIe 80GB	144.49	67.79	25.01	OOM
3090 24GB * 4	104.94	46.40	16.89	OOM
4090 24GB * 4	117.61	52.69	18.83	OOM
RTX 5000 Ada 32GB * 4	82.73	31.94	11.45	OOM
3090 24GB * 6	101.07	45.55	16.93	5.82
4090 24GB * 8	116.13	52.12	18.76	6.45
RTX A6000 48GB * 4	93.73	38.87	14.32	4.74
RTX 6000 Ada 48GB * 4	118.99	50.25	17.96	6.06
A40 48GB * 4	83.79	33.28	11.91	3.98
L40S 48GB * 4	105.72	42.48	14.99	5.03
A100 PCIe 80GB * 4	117.30	51.54	22.68	7.38
A100 SXM 80GB * 4	97.70	45.45	19.60	6.92
H100 PCIe 80GB * 4	118.14	62.90	26.20	9.63
M1 7‑Core GPU 8GB	9.72	OOM	OOM	OOM
M1 Max 32‑Core GPU 64GB	34.49	18.43	4.09	OOM
M2 Ultra 76-Core GPU 192GB	76.28	36.25	12.13	4.71
M3 Max 40‑Core GPU 64GB	50.74	22.39	7.53	OOM

Average 1024 tokens prompt eval speed (tokens/s) by GPUs on LLaMA 3.

GPU	8B Q4_K_M	8B F16	70B Q4_K_M	70B F16
3070 8GB	2283.62	OOM	OOM	OOM
3080 10GB	3557.02	OOM	OOM	OOM
3080 Ti 12GB	3556.67	OOM	OOM	OOM
4070 Ti 12GB	3653.07	OOM	OOM	OOM
4080 16GB	5064.99	6758.90	OOM	OOM
RTX 4000 Ada 20GB	2310.53	2951.87	OOM	OOM
3090 24GB	3865.39	4239.64	OOM	OOM
4090 24GB	6898.71	9056.26	OOM	OOM
RTX 5000 Ada 32GB	4467.46	5835.41	OOM	OOM
3090 24GB * 2	4004.14	4690.50	393.89	OOM
4090 24GB * 2	8545.00	11094.51	905.38	OOM
RTX A6000 48GB	3621.81	4315.18	466.82	OOM
RTX 6000 Ada 48GB	5560.94	6205.44	547.03	OOM
A40 48GB	3240.95	4043.05	239.92	OOM
L40S 48GB	5908.52	2491.65	649.08	OOM
RTX 4000 Ada 20GB * 4	3369.24	4366.64	306.44	OOM
A100 PCIe 80GB	5800.48	7504.24	726.65	OOM
A100 SXM 80GB	5863.92	681.47	796.81	OOM
H100 PCIe 80GB	7760.16	10342.63	984.06	OOM
3090 24GB * 4	4653.93	5713.41	350.06	OOM
4090 24GB * 4	9609.29	12304.19	898.17	OOM
RTX 5000 Ada 32GB * 4	6530.78	2877.66	541.54	OOM
3090 24GB * 6	5153.05	5952.55	739.40	927.23
4090 24GB * 8	9706.82	11818.92	1336.26	1890.48
RTX A6000 48GB * 4	5340.10	6448.85	539.20	792.23
RTX 6000 Ada 48GB * 4	9679.55	12637.94	714.93	1270.39
A40 48GB * 4	4841.98	5931.06	263.36	900.79
L40S 48GB * 4	9008.27	2541.61	634.05	1478.83
A100 PCIe 80GB * 4	8889.35	11670.74	978.06	1733.41
A100 SXM 80GB * 4	7782.25	674.11	539.08	1834.16
H100 PCIe 80GB * 4	11560.23	15612.81	1133.23	2420.10
M1 7‑Core GPU 8GB	87.26	OOM	OOM	OOM
M1 Max 32‑Core GPU 64GB	355.45	418.77	33.01	OOM
M2 Ultra 76-Core GPU 192GB	1023.89	1202.74	117.76	145.82
M3 Max 40‑Core GPU 64GB	678.04	751.49	62.88	OOM

Model

Thanks to shawwn for LLaMA model weights (7B, 13B, 30B, 65B): llama-dl. Access LLaMA 2 from Meta AI. Access LLaMA 3 from Meta Llama 3 on Hugging Face or my Hugging Face repos: Xiongjie Dai.

Usage

Build

For NVIDIA GPUs, this provides BLAS acceleration using the CUDA cores of your Nvidia GPU:
```
!make clean && LLAMA_CUBLAS=1 make -j
```
For Apple Silicon, Metal is enabled by default:
```
!make clean && make -j
```

Text Completion

Use argument -ngl 0 to only use the CPU for inference and -ngl 10000 to ensure all layers are offloaded to the GPU.

!./main -ngl 10000 -m ./models/8B-v3/ggml-model-Q4_K_M.gguf --color --temp 1.1 --repeat_penalty 1.1 -c 0 -n 1024 -e -s 0 -p """\
First Citizen:\n\n\
Before we proceed any further, hear me speak.\n\n\
\n\n\
All:\n\n\
Speak, speak.\n\n\
\n\n\
First Citizen:\n\n\
You are all resolved rather to die than to famish?\n\n\
\n\n\
All:\n\n\
Resolved. resolved.\n\n\
\n\n\
First Citizen:\n\n\
First, you know Caius Marcius is chief enemy to the people.\n\n\
\n\n\
All:\n\n\
We know't, we know't.\n\n\
\n\n\
First Citizen:\n\n\
Let us kill him, and we'll have corn at our own price. Is't a verdict?\n\n\
\n\n\
All:\n\n\
No more talking on't; let it be done: away, away!\n\n\
\n\n\
Second Citizen:\n\n\
One word, good citizens.\n\n\
\n\n\
First Citizen:\n\n\
We are accounted poor citizens, the patricians good. What authority surfeits on would relieve us: if they would yield us but the superfluity, \
while it were wholesome, we might guess they relieved us humanely; but they think we are too dear: the leanness that afflicts us, the object of \
our misery, is as an inventory to particularise their abundance; our sufferance is a gain to them Let us revenge this with our pikes, \
ere we become rakes: for the gods know I speak this in hunger for bread, not in thirst for revenge.\n\n\
\n\n\
"""

Note: For Apple Silicon, check the recommendedMaxWorkingSetSize in the result to see how much memory can be allocated on the GPU and maintain its performance. Only 70% of unified memory can be allocated to the GPU on 32GB M1 Max right now, and we expect around 78% of usable memory for the GPU on larger memory. (Source: https://developer.apple.com/videos/play/tech-talks/10580/?time=346) To utilize the whole memory, use -ngl 0 to only use the CPU for inference. (Thanks to: https://github.com/ggerganov/llama.cpp/pull/1826)

Chat template for LLaMA 3 🦙🦙🦙

!./main -ngl 10000 -m ./models/8B-v3-instruct/ggml-model-Q4_K_M.gguf --color -c 0 -n -2 -e -s 0 --mirostat 2 -i --no-display-prompt --keep -1 \
-r '<|eot_id|>' -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' \
--in-prefix '<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

Benchmark

!./llama-bench -p 512,1024,4096,8192 -n 512,1024,4096,8192 -m ./models/8B-v3/ggml-model-Q4_K_M.gguf

Total VRAM Requirements

Model	Quantized size (Q4_K_M)	Original size (f16)
8B	4.58 GB	14.96 GB
70B	39.59 GB	131.42 GB

You may estimate that VRAM requirement using this tool: LLM RAM Calculator

Perplexity table on LLaMA 3 70B

Less perplexity is better. (credit to: dranger003)

Quantization	Size (GiB)	Perplexity (wiki.test)	Delta (FP16)
IQ1_S	14.29	9.8655 +/- 0.0625	248.51%
IQ1_M	15.60	8.5193 +/- 0.0530	201.94%
IQ2_XXS	17.79	6.6705 +/- 0.0405