Awesome Efficient AIGC
This repo collects efficient approaches for AI-Generated Content (AIGC) to cope with its huge demand for computing resources, including efficient Large Language Models (LLMs), Diffusion Models (DMs), etc. We are continuously improving the project. Welcome to PR the works (papers, repositories) missed by the repo. Special thanks to Xingyu Zheng, Xudong Ma, Yifu Ding, and all researchers who have contributed to this project!
Table of Contents
Survey
- [Arxiv] Efficient Prompting Methods for Large Language Models: A Survey
- [Arxiv] Efficient Diffusion Models for Vision: A Survey
- [Arxiv] Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward [code]
- [Arxiv] A Survey on Knowledge Distillation of Large Language Models [code]
- [Arxiv] Model Compression and Efficient Inference for Large Language Models: A Survey
- [Arxiv] A Survey on Transformer Compression
- [Arxiv] A Comprehensive Survey of Compression Algorithms for Language Models
- [Arxiv] Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding [code] [Blog]
- [Arxiv] Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security [code]
- [Arxiv] A Survey on Hardware Accelerators for Large Language Models
- [Arxiv] A Survey of Resource-efficient LLM and Multimodal Foundation Models [code]
- [Arxiv] Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models [code]
- [Arxiv] Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
- [Arxiv] Efficient Large Language Models: A Survey [code]
- [Arxiv] The Efficiency Spectrum of Large Language Models: An Algorithmic Survey [code]
- [Arxiv] A Survey on Model Compression for Large Language Models
- [Arxiv] A Comprehensive Survey on Knowledge Distillation of Diffusion Models
- [TACL] Compressing Large-Scale Transformer-Based Models: A Case Study on BERT
- [JSA] A Survey of Techniques for Optimizing Transformer Inference
- [Arxiv] Understanding LLMs: A Comprehensive Overview from Training to Inference
Language
2024
Quantization
- [arXiv] How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study [code] [HuggingFace]
- [ArXiv] Accurate LoRA-Finetuning Quantization of LLMs via Information Retention [code]
- [ArXiv] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [code]
- [ArXiv] DB-LLM: Accurate Dual-Binarization for Efficient LLMs
- [ArXiv] Extreme Compression of Large Language Models via Additive Quantization
- [ArXiv] Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models
- [ArXiv] FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
- [ArXiv] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
- [ArXiv] EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge [code]
- [ArXiv] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
- [ArXiv] LQER: Low-Rank Quantization Error Reconstruction for LLMs
- [ArXiv] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [code]
- [ArXiv] QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks [code]
- [ArXiv] L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ
- [ArXiv] TP-Aware Dequantization
- [ArXiv] ApiQ: Finetuning of 2-Bit Quantized Large Language Model
- [ArXiv] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
- [ArXiv] BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation [code]
- [ArXiv] OneBit: Towards Extremely Low-bit Large Language Models
- [ArXiv] WKVQuant: Quantising Weight and Key/Value Cache for Large Language Models Gains More
- [ArXiv] GPTVQ: The Blessing of Dimensionality for LLM Quantization [code]
- [DAC] APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models
- [DAC] A Comprehensive Evaluation of Quantization Strategies for Large Language Models
- [ArXiv] No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization
- [ArXiv] Evaluating Quantized Large Language Models
- [ArXiv] FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization
- [ArXiv] LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
- [ArXiv] IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact
- [ArXiv] On the Compressibility of Quantized Large Language Models
- [ArXiv] EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
- [ArXiv] QAQ: Quality Adaptive Quantization for LLM KV Cache [code]
- [ArXiv] GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
- [ArXiv] What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation
- [ArXiv] SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression [code]
- [ICLR] AffineQuant: Affine Transformation Quantization for Large Language Models [code]
- [ICLR Practical ML for Low Resource Settings Workshop] Oh! We Freeze: Improving Quantized Knowledge Distillation via Signal Propagation Analysis for Large Language Models
- [ArXiv] Accurate Block Quantization in LLMs with Outliers
- [ArXiv] QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs [code]
- [ArXiv] Minimize Quantization Output Error with Bias Compensation [code]
- [ArXiv] Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models
Fine-tuning
- [ArXiv] BitDelta: Your Fine-Tune May Only Be Worth One Bit [code]
- [AAAI EIW Workshop 2024] QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning
Other
- [ArXiv] FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGA
- [ArXiv] Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models
2023
Quantization
- [ICLR] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers [code]
- [NeurIPS] QLORA: Efficient Finetuning of Quantized LLMs [code]
- [NeurIPS] Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
- [ICML] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models [code]
- [ICML] FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization