Awesome-Quantization-Papers

This repo contains a comprehensive paper list of Model Quantization for efficient deep learning on AI conferences/journals/arXiv. As a highlight, we categorize the papers in terms of model structures and application scenarios, and label the quantization methods with keywords.

This repo is being actively updated, and contributions in any form to make this list more comprehensive are welcome. Special thanks to collaborator Zhikai Li, and all researchers who have contributed to this repo!

If you find this repo useful, please consider ★STARing and feel free to share it with others!

[Update: Jul, 2024] Add new papers from CVPR-24. [Update: May, 2024] Add new papers from ICLR-24. [Update: Apr, 2024] Add new papers from AAAI-24. [Update: Nov, 2023] Add new papers from NeurIPS-23. [Update: Oct, 2023] Add new papers from ICCV-23. [Update: Jul, 2023] Add new papers from AAAI-23 and ICML-23. [Update: Jun, 2023] Add new arXiv papers uploaded in May 2023, especially the hot LLM quantization field. [Update: Jun, 2023] Reborn this repo! New style, better experience!

Overview

Awesome-Quantization-Papers

Keywords: PTQ: post-training quantization | Non-uniform: non-uniform quantization | MP: mixed-precision quantization | Extreme: binary or ternary quantization

Survey

"A Survey of Quantization Methods for Efficient Neural Network Inference", Book Chapter: Low-Power Computer Vision, 2021. [paper]
"Full Stack Optimization of Transformer Inference: a Survey", arXiv, 2023. [paper]
"A White Paper on Neural Network Quantization", arXiv, 2021. [paper]
"Binary Neural Networks: A Survey", PR, 2020. [Paper] [Extreme]

Transformer-based Models

Vision Transformers

"PTQ4SAM: Post-Training Quantization for Segment Anything", CVPR, 2024. [paper] [PTQ]
"Instance-Aware Group Quantization for Vision Transformers", CVPR, 2024. [paper] [PTQ]
"Bi-ViT: Pushing the Limit of Vision Transformer Quantization", AAAI, 2024. [paper] [Extreme]
"AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries", AAAI, 2024. [paper]
"LRP-QViT: Mixed-Precision Vision Transformer Quantization via Layer-wise Relevance Propagation", arXiv, 2023. [paper] [PTQ] [MP]
"MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer", arXiv, 2023. [paper] [PTQ] [MP]
"I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference", ICCV, 2023. [paper] [code]
"RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers", ICCV, 2023. [paper] [code] [PTQ]
"QD-BEV: Quantization-aware View-guided Distillation for Multi-view 3D Object Detection", ICCV, 2023. [paper]
"BiViT: Extremely Compressed Binary Vision Transformers", ICCV, 2023. [paper] [Extreme]
"Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers", ICCV, 2023. [paper]
"PackQViT: Faster Sub-8-bit Vision Transformers via Full and Packed Quantization on the Mobile", NeurIPS, 2023. [paper]
"Oscillation-free Quantization for Low-bit Vision Transformers", ICML, 2023. [paper] [code]
"PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers", TNNLS, 2023. [paper]
"Variation-aware Vision Transformer Quantization", arXiv, 2023. [paper]
"NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers", CVPR, 2023. [paper] [PTQ]
"Boost Vision Transformer with GPU-Friendly Sparsity and Quantization", CVPR, 2023. [paper]
"Q-DETR: An Efficient Low-Bit Quantized Detection Transformer", CVPR, 2023. [paper]
"Output Sensitivity-Aware DETR Quantization", 2023. [paper]
"Q-HyViT: Post-Training Quantization for Hybrid Vision Transformer with Bridge Block Reconstruction", arXiv, 2023. [paper] [PTQ]
"Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer", NeurIPS, 2022. [paper] [code]
"Patch Similarity Aware Data-Free Quantization for Vision Transformers", ECCV, 2022. [paper] [code] [PTQ]
"PTQ4ViT: Post-Training Quantization for Vision Transformers with Twin Uniform Quantization", ECCV, 2022. [paper] [code] [PTQ]
"FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer", IJCAI, 2022. [paper] [code] [PTQ]
"Q-ViT: Fully Differentiable Quantization for Vision Transformer", arXiv, 2022. [paper]
"Post-Training Quantization for Vision Transformer", NeurIPS, 2021. [paper] [PTQ]

[Back to Overview]

Language Transformers

"OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models", ICLR, 2024. [paper]"
"LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models", ICLR, 2024. [paper]
"SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression", ICLR, 2024. [paper] [PTQ]
"QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models", ICLR, 2024. [paper]
"QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models", ICLR, 2024. [paper] [PTQ]
"PB-LLM: Partially Binarized Large Language Models", ICLR, 2024. [paper] [Extreme]
"AffineQuant: Affine Transformation Quantization for Large Language Models", ICLR, 2024. [paper]
"Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models", ICLR, 2024. [paper]
"LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models", ICLR, 2024. [paper]
"OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models", AAAI, 2024. [paper]
"Norm Tweaking: High-Performance Low-Bit Quantization of Large Language Models", AAAI, 2024. [paper]
"Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge", AAAI, 2024. [paper]
"Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation", AAAI, 2024. [paper] [PTQ]
"What Makes Quantization for Large Language Model Hard? An Empirical Study from the Lens of Perturbation", AAAI, 2024. [paper]
"EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs", arXiv, 2024. [paper]
"IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact", arXiv, 2024. [paper]
"FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization", arXiv, 2024. [paper]
"A Comprehensive Evaluation of Quantization Strategies for Large Language Models", arXiv, 2024. [paper]
"GPTVQ: The Blessing of Dimensionality for LLM Quantization", arXiv, 2024. [paper]
"APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models", arXiv, 2024. [paper]
"EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge", arXiv, 2024. [paper]
"RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization", arXiv, 2024. [paper]
"Accurate LoRA-Finetuning Quantization of LLMs via Information Retention", arXiv, 2024. [paper]
"BiLLM: Pushing the Limit of Post-Training Quantization for LLMs", arXiv, 2024. [paper]
"KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization", arXiv, 2023. [paper]
"Extreme Compression of Large Language Models via Additive Quantization", arXiv, 2023. [paper]
"ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks", arXiv, 2023. [paper] [PTQ]
"CBQ: Cross-Block Quantization for Large Language Models", arXiv, 2023. [paper] [PTQ]
"FP8-BERT: Post-Training Quantization for Transformer", arXiv, 2023. [paper] [PTQ]
"Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge", arXiv, 2023. [paper]
"SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM", arXiv, 2023. [paper] [PTQ]
"A Speed Odyssey for Deployable Quantization of LLMs", arXiv, 2023.