Mooncake x SGLang Integration

Mooncake x SGLang Integration#

Mooncake integrates with SGLang through two paths — PD Disaggregation for cross-instance KV cache transfer via the Transfer Engine, and HiCache L3 Backend for hierarchical KV cache storage with Mooncake Store.


PD Disaggregation#

SGLang uses Mooncake’s Transfer Engine for direct zero-copy KV cache transfer between prefill and decode instances over RDMA, with support for EP and EPD backends. In benchmarks, PD disaggregation with Mooncake achieves ~30% lower ITL while maintaining comparable throughput.

  +-----------+   Transfer Engine (RDMA)   +------------+
  | SGLang    | ◄━━━━━━━━━━━━━━━━━━━━━━►   | SGLang     |
  | Prefill   |     KV cache blocks        | Decode     |
  +-----------+                            +------------+

Related: Full PD Disaggregation Guide — installation, cross-node/same-node setup, XpYd topology, EP backend for MoE models, and EPD backend for multimodal models.

Benchmark: PD Disaggregation Performance — compares 1P1D disaggregation with regular SGLang instances.


HiCache with Mooncake Store#

HiCache extends SGLang’s RadixAttention with three memory tiers, using Mooncake Store as the distributed L3 backend. When local cache misses, HiCache automatically prefetches KV blocks from remote storage via RDMA.

  +----------------------------------------------+
  | SGLang + HiCache                             |
  |  ┌─────────┐  ┌─────────┐  ┌──────────────┐  |
  |  │ L1(GPU) │  │ L2(CPU) │  │ L3(Mooncake) │  |
  |  └─────────┘  └─────────┘  └──────┬───────┘  |
  +-----------------------------------+----------+
                                      |
                             +--------+---------+
                             | Mooncake Store   |
                             | Distributed Pool |
                             +------------------+

Related: