Awesome Model Quantization

This repo collects papers, docs, codes about model quantization for anyone who wants to do research on it. We are continuously improving the project. Welcome to PR the works (papers, repositories) that are missed by the repo. Special thanks to Xingyu Zheng, Yifu Ding, Xudong Ma, Yuxuan Wen, and all researchers who have contributed to this project!

Efficient AIGC Repo
Benchmark
- BiBench
- MQBench
Survey_Papers
- Survey_of_Binarization
- Survey_of_Quantization
Papers
- 2024
- 2023
- 2022
- 2021
- 2020
- 2019
- 2018
- 2017
- 2016
- 2015

Efficient_AIGC_Repo

We highlight our newly released awesome open-source project "Awesome Efficient AIGC". Specifically, this project focuses on recent methods for compression and acceleration of generative models, such as large language models and diffusion models. Welcome to Star the Repo or PR any work you like!

https://github.com/htqin/awesome-efficient-aigc GitHub Repo stars

Benchmark

BiBench

The paper BiBench: Benchmarking and Analyzing Network Binarization (ICML 2023) a rigorously designed benchmark with in-depth analysis for network binarization. For details, please refer to:

BiBench: Benchmarking and Analyzing Network Binarization [Paper] [Project]

Haotong Qin, Mingyuan Zhang, Yifu Ding, Aoyu Li, Zhongang Cai, Ziwei Liu, Fisher Yu, Xianglong Liu.

<details><summary>Bibtex</summary><pre><code>@inproceedings{qin2023bibench, title={BiBench: Benchmarking and Analyzing Network Binarization}, author={Qin, Haotong and Zhang, Mingyuan and Ding, Yifu and Li, Aoyu and Cai, Zhongang and Liu, Ziwei and Yu, Fisher and Liu, Xianglong}, booktitle={International Conference on Machine Learning (ICML)}, year={2023} }</code></pre></details>

survey

MQBench

The paper MQBench: Towards Reproducible and Deployable Model Quantization Benchmark (NeurIPS 2021) is a benchmark and framework for evluating the quantization algorithms under real world hardware deployments. For details, please refer to:

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark [Paper] [Project]

Yuhang Li, Mingzhu Shen, Jian Ma, Yan Ren, Mingxin Zhao, Qi Zhang, Ruihao Gong, Fengwei Yu, Junjie Yan.

<details><summary>Bibtex</summary><pre><code>@article{2021MQBench, title = "MQBench: Towards Reproducible and Deployable Model Quantization Benchmark", author= "Yuhang Li* and Mingzhu Shen* and Jian Ma* and Yan Ren* and Mingxin Zhao* and Qi Zhang* and Ruihao Gong and Fengwei Yu and Junjie Yan", journal = "https://openreview.net/forum?id=TUplOmF8DsM", year = "2021" }</code></pre></details>

survey

Survey_Papers

Survey_of_Binarization

Our survey paper Binary Neural Networks: A Survey (Pattern Recognition) is a comprehensive survey of recent progress in binary neural networks. For details, please refer to:

Binary Neural Networks: A Survey [Paper] [Blog]

Haotong Qin, Ruihao Gong, Xianglong Liu*, Xiao Bai, Jingkuan Song, and Nicu Sebe.

<details><summary>Bibtex</summary><pre><code>@article{Qin:pr20_bnn_survey, title = "Binary neural networks: A survey", author = "Haotong Qin and Ruihao Gong and Xianglong Liu and Xiao Bai and Jingkuan Song and Nicu Sebe", journal = "Pattern Recognition", volume = "105", pages = "107281", year = "2020" }</code></pre></details>

survey

Survey_of_Quantization

The survey paper A Survey of Quantization Methods for Efficient Neural Network Inference (ArXiv) is a comprehensive survey of recent progress in quantization. For details, please refer to:

A Survey of Quantization Methods for Efficient Neural Network Inference [Paper]

Amir Gholami* , Sehoon Kim* , Zhen Dong* , Zhewei Yao* , Michael W. Mahoney, Kurt Keutzer. (* Equal contribution)

<details><summary>Bibtex</summary><pre><code>@misc{gholami2021survey, title={A Survey of Quantization Methods for Efficient Neural Network Inference}, author={Amir Gholami and Sehoon Kim and Zhen Dong and Zhewei Yao and Michael W. Mahoney and Kurt Keutzer}, year={2021}, eprint={2103.13630}, archivePrefix={arXiv}, primaryClass={cs.CV} }</code></pre></details>

Papers

Keywords: qnn: quantized neural networks | bnn: binarized neural networks | hardware: hardware deployment | snn: spiking neural networks | other

2024

[arXiv] How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study [code] [HuggingFace]
[arXiv] Accurate LoRA-Finetuning Quantization of LLMs via Information Retention [code]
[arXiv] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [code]
[arXiv] BinaryDM: Towards Accurate Binarization of Diffusion Model [code]
[arXiv] DB-LLM: Accurate Dual-Binarization for Efficient LLMs
[arXiv] BinaryDM: Towards Accurate Binarization of Diffusion Model [code]
[arXiv] OHQ: On-chip Hardware-aware Quantization
[arXiv] Post-Training Quantization with Low-precision Minifloats and Integers on FPGAs [code][hardware]
[arXiv] Extreme Compression of Large Language Models via Additive Quantization
[arXiv] Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models
[arXiv] FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
[arXiv] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
[arXiv] EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge [code]
[arXiv] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
[arXiv] LQER: Low-Rank Quantization Error Reconstruction for LLMs
[arXiv] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [code]
[arXiv] QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks [code]
[arXiv] L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ
[arXiv] TP-Aware Dequantization
[arXiv] ApiQ: Finetuning of 2-Bit Quantized Large Language Model
[arXiv] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
[arXiv] BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation [code]
[arXiv] OneBit: Towards Extremely Low-bit Large Language Models
[arXiv] WKVQuant: Quantising Weight and Key/Value Cache for Large Language Models Gains More
[arXiv] GPTVQ: The Blessing of Dimensionality for LLM Quantization [code]
[DAC] APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models
[DAC] A Comprehensive Evaluation of Quantization Strategies for Large Language Models
[arXiv] No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization
[arXiv] Evaluating Quantized Large Language Models
[arXiv] FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization
[arXiv] LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
[arXiv] IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact
[arXiv] On the Compressibility of Quantized Large Language Models
[arXiv] EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
[arXiv] QAQ: Quality Adaptive Quantization for LLM KV Cache [code]
[arXiv] GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
[arXiv] What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation
[arXiv] SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression [code]
[ICLR] AffineQuant: Affine Transformation Quantization for Large Language Models [code]
[ICLR Practical ML for Low Resource Settings Workshop] Oh! We Freeze: Improving Quantized Knowledge Distillation via Signal Propagation Analysis for Large Language Models
[arXiv] Accurate Block Quantization in LLMs with Outliers
[arXiv] QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs [code]
[arXiv] Minimize Quantization Output Error with Bias Compensation [code] ![GitHub Repo