LLM Serving Papers

Last Update: 2024-09-11

Survey

Paper	Code	Category	Abstract
[2407.12391] LLM Inference Serving: Survey of Recent Advances and Opportunities (arxiv.org)			Focus on system-level enhancements without altering LLM decoding algorithms, including memory (KV cache management), compute (scheduling), scalability (cloud-LLM) and emerging fields
[2407.14645] Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference (arxiv.org)
[2404.14294] A Survey on Efficient Inference for Large Language Models (arxiv.org)
[2402.16363] LLM Inference Unveiled: Survey and Roofline Model Insights (arxiv.org)	https://github.com/hahnyuan/LLM-Viewer		Framework based on Roofline model, profiling different hardware and inference configs. Cover model compression, algorithm, system and hardware optimizations
The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving		Long Context

Paper	Code	Scenario	Challenge	Method	Category
[2303.06865] FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (arxiv.org)	https://github.com/FMInference/FlexiGen	latency-insensitive tasks with batch processing	inefficiency in offloading with GPU,CPU,Disk. what to offload, where to offload in memory hierarchy, when to offload during inference	inference as computational graph. cost model by profiling IO and compute, with variables of tensor placements -> linear-programming model
[2408.07092] Post-Training Sparse Attention with Double Sparsity (arxiv.org)	https://github.com/andy-yang-1/DoubleSparse	post-training sparse attention, retain full KV Cache	retrieval accuracy(retain full KV Cache), hardware-friendly(continuous memory access), memory usage(manage offloading with full KV Cache)	two level algorithm: static channel sparsity + dynamic token sparsity. engineering: compact label cache, prefetching token based on embedding similarities

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md