Mooncake x vLLM Integration#

Overview#

Mooncake integrates with vLLM to accelerate large language model serving through high-performance KV cache transfer and shared storage. The integration supports two primary scenarios:

  • Disaggregated Prefill-Decode Serving: Seamlessly split prefill and decode across nodes using MooncakeConnector, with RDMA-powered cross-node KV cache transfer achieving up to 142.25 GB/s peak bandwidth (71.1% utilization of 8x RoCE). Transfer overhead is negligible — for 32K-token prompts (4.50 GB of KV data), transfer takes only 31.65 ms, accounting for just 4.2% of total TTFT.

  • KV Cache Storage & Sharing: Extend effective KV cache capacity via MooncakeStore / MooncakeStoreConnector, with hash-based prefix caching that enables multiple vLLM instances to share cached KV blocks. Supports CPU/Disk offloading and dynamic XpYd topologies at runtime.

Scenario

Guide

vLLM Backend

PD Disaggregation (KV transfer)

Disaggregated Prefill-Decode

V1 ✅ / V0 ⚠️

KV Cache Storage & Sharing

KV Cache Storage with MooncakeStore

V1 ✅ / V0 ⚠️

For detailed benchmark coverage across these scenarios, see vLLM Integration Performance Benchmarks.

New to Mooncake + vLLM?

Start with the V1 guides above. Legacy V0 documentation is available for existing deployments only.


Getting Started#

Disaggregated Prefill-Decode#

Direct KV cache transfer between prefill and decode nodes via MooncakeConnector using RDMA.

KV Cache Storage & Sharing#

Distributed KV cache storage via MooncakeStore / MooncakeStoreConnector for offloading, prefix caching, and cross-instance sharing.


Archived Documentation#

The following pages are from earlier versions of the integration and are no longer maintained. All content has been consolidated into the scenario-based guides above.