Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Lin, Bin; Zhang, Chen; Peng, Tao; Zhao, Hanyu; Xiao, Wencong; Sun, Minmin; Liu, Anmin; Zhang, Zhipeng; Li, Lanbo; Qiu, Xiafei; Li, Shen; Ji, Zhigang; Xie, Tao; Li, Yong; Lin, Wei

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2401.02669 (cs)

[Submitted on 5 Jan 2024 (v1), last revised 4 Jul 2024 (this version, v2)]

Title:Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Authors:Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, Wei Lin

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) demonstrate substantial potential across a diverse array of domains via request serving. However, as trends continue to push for expanding context sizes, the autoregressive nature of LLMs results in highly dynamic behavior of the attention layers, showcasing significant differences in computational characteristics and memory requirements from the non-attention layers. This presents substantial challenges for resource management and performance optimization in service systems. Existing static model parallelism and resource allocation strategies fall short when dealing with this dynamicity. To address the issue, we propose Infinite-LLM, a novel LLM serving system designed to effectively handle dynamic context lengths. Infinite-LLM disaggregates attention layers from an LLM's inference process, facilitating flexible and independent resource scheduling that optimizes computational performance and enhances memory utilization jointly. By leveraging a pooled GPU memory strategy across a cluster, Infinite-LLM not only significantly boosts system throughput but also supports extensive context lengths. Evaluated on a dataset with context lengths ranging from a few to 2000K tokens across a cluster with 32 A100 GPUs, Infinite-LLM demonstrates throughput improvement of 1.35-3.4x compared to state-of-the-art methods, enabling efficient and elastic LLM deployment.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR)
Cite as:	arXiv:2401.02669 [cs.DC]
	(or arXiv:2401.02669v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2401.02669

Submission history

From: Bin Lin [view email]
[v1] Fri, 5 Jan 2024 06:53:00 UTC (20,608 KB)
[v2] Thu, 4 Jul 2024 15:12:54 UTC (13,201 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators