📒Introduction
Awesome-LLM-Inference: A curated list of 📙Awesome LLM Inference Papers with Codes. For Awesome SD Inference with Distributed/Caching/Sampling , please check 📖Awesome-SD-Inference
©️Citations
@misc{Awesome-LLM-Inference@2024,
title={Awesome-LLM-Inference: A curated list of Awesome LLM Inference Papers with codes},
url={https://github.com/DefTruth/Awesome-LLM-Inference},
note={Open-source software available at https://github.com/DefTruth/Awesome-LLM-Inference},
author={DefTruth, liyucheng09 etc},
year={2024}
}
📙Awesome LLM Inference Papers with Codes
🎉Download PDFs
Awesome LLM Inference for Beginners.pdf: 500 pages, FastServe, FlashAttention 1/2, FlexGen, FP8, LLM.int8(), PagedAttention, RoPE, SmoothQuant, WINT8/4, Continuous Batching, ZeroQuant 1/2/FP, AWQ etc.
📖Contents
📖Trending LLM/VLM Topics (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|
2024.04 | 🔥🔥🔥[Open-Sora] Open-Sora: Democratizing Efficient Video Production for All(@hpcaitech) | [docs] | [Open-Sora] | ⭐️⭐️ |
2024.04 | 🔥🔥🔥[Open-Sora Plan] Open-Sora Plan: This project aim to reproduce Sora (Open AI T2V model)(@PKU) | [report] | [Open-Sora-Plan] | ⭐️⭐️ |
2024.05 | 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI) | [pdf] | [DeepSeek-V2] | ⭐️⭐️ |
2024.05 | 🔥🔥[YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft) | [pdf] | [unilm-YOCO] | ⭐️⭐️ |
2024.06 | 🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) | [pdf] | [Mooncake] | ⭐️⭐️ |
2024.07 | 🔥🔥[FlashAttention-3] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc) | [pdf] | [flash-attention] | ⭐️⭐️ |
2024.07 | 🔥🔥[MInference 1.0] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft) | [pdf] | [MInference 1.0] | ⭐️⭐️ |
📖LLM Algorithmic/Eval Survey (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|
2023.10 | [Evaluating] Evaluating Large Language Models: A Comprehensive Survey(@tju.edu.cn) | [pdf] | [Awesome-LLMs-Evaluation] | ⭐️ |
2023.11 | 🔥[Runtime Performance] Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models(@hkust-gz.edu.cn) | [pdf] | ⚠️ | ⭐️⭐️ |
2023.11 | [ChatGPT Anniversary] ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?(@e.ntu.edu.sg) | [pdf] | ⚠️ | ⭐️ |
2023.12 | [Algorithmic Survey] The Efficiency Spectrum of Large Language Models: An Algorithmic Survey(@Microsoft) | [pdf] | ⚠️ | ⭐️ |
2023.12 | [Security and Privacy] A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly(@Drexel University) | [pdf] | ⚠️ | ⭐️ |
2023.12 | 🔥[LLMCompass] A Hardware Evaluation Framework for Large Language Model Inference(@princeton.edu) | [pdf] | ⚠️ | ⭐️⭐️ |
2023.12 | 🔥[Efficient LLMs] Efficient Large Language Models: A Survey(@Ohio State University etc) | [pdf] | [Efficient-LLMs-Survey] | ⭐️⭐️ |
2023.12 | [Serving Survey] Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems(@Carnegie Mellon University) | [pdf] | ⚠️ | ⭐️⭐️ |
2024.01 | [Understanding LLMs] Understanding LLMs: A Comprehensive Overview from Training to Inference(@Shaanxi Normal University etc) | [pdf] | ⚠️ | ⭐️⭐️ |
2024.02 | [LLM-Viewer] LLM Inference Unveiled: Survey and Roofline Model Insights(@Zhihang Yuan etc) | [pdf] | [LLM-Viewer] | ⭐️⭐️ |
2024.07 | [Internal Consistency & Self-Feedback] Internal Consistency and Self-Feedback in Large Language Models: A Survey | [pdf] | [ICSF-Survey] | ⭐️⭐️ |
📖LLM Train/Inference Framework/Design (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|
2020.05 | 🔥[Megatron-LM] Training Multi-Billion Parameter Language Models Using Model Parallelism(@NVIDIA) | [pdf] | [Megatron-LM] | ⭐️⭐️ |
2023.03 | [FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU(@Stanford University etc) | [pdf] | [FlexGen] | ⭐️ |
2023.05 | [SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification(@Peking University etc) | [pdf] | [FlexFlow] | ⭐️ |
2023.05 | [FastServe] Fast Distributed Inference Serving for Large Language Models(@Peking University etc) | [pdf] | ⚠️ | ⭐️ |
2023.09 | 🔥[vLLM] Efficient Memory Management for Large Language Model Serving with PagedAttention(@UC Berkeley etc) | [pdf] | [vllm] | ⭐️⭐️ |
2023.09 | [StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS(@Meta AI etc) | [pdf] | [streaming-llm] | ⭐️ |
2023.09 | [Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads(@Tianle Cai etc) | [blog] | [Medusa] | ⭐️ |
2023.10 | 🔥[TensorRT-LLM] NVIDIA TensorRT LLM(@NVIDIA) | [docs] | [TensorRT-LLM] | ⭐️⭐️ |
2023.11 | 🔥[DeepSpeed-FastGen 2x vLLM?] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference(@Microsoft) | [pdf] | [deepspeed-fastgen] | |