# Allocator Performance

Mooncake Store uses two allocators at different layers:

- **MmapArena** — backs the mmap buffer path used by SGLang HiCache for host KV cache memory. When explicitly enabled, it pre-allocates a hugepage-backed pool at startup and serves allocations via lock-free atomic bump pointer (~50ns per allocation). Configured via `MC_MMAP_ARENA_POOL_SIZE` (explicit opt-in) and `MC_DISABLE_MMAP_ARENA`.
- **OffsetAllocator** — manages object-level allocation within Mooncake Store segments for KV cache entries. Evaluated below.

These are complementary: MmapArena handles the backing memory, OffsetAllocator manages logical object placement within that memory.

## OffsetAllocator

We evaluated the performance of [OffsetAllocator](https://github.com/sebbbi/OffsetAllocator), the default memory allocator in Mooncake Store. This allocator is responsible for allocating memory from mounted segments to store the KV cache.

In this context, the most important metric is **memory utilization**, defined as the ratio between the amount of memory that can be successfully allocated and the total available memory. A higher utilization means that more KV tensors can be cached, thereby accelerating LLM tasks. However, due to memory fragmentation, allocation may fail even when the allocated memory is well below the total available capacity.

For the same allocator, memory utilization can vary significantly under different workloads. Therefore, we evaluated the allocator's efficiency across a range of workloads.

In particular, in the **LLM inference** scenario, once the model is fixed, the size of each KV vector is also fixed. This means memory utilization under **uniform allocation sizes** becomes especially important. However, the original OffsetAllocator has a limitation: when the allocation size is uniform but does not match any of OffsetAllocator's predefined bin sizes, memory utilization can be suboptimal.

To address this, we introduced targeted optimizations for uniform-size workloads on top of the original OffsetAllocator. As shown in our test results, the optimized version achieves **significant performance improvements** in such scenarios.

## Execution

```bash
./mooncake-store/benchmarks/allocator_bench
```

## Result

- alloc size: The size of each object
- utilization ratio: The total allocated size / total space
- time: time in nanoseconds for each object allocation
- OffsetAllocator optimization: round up the allocated size to a bin size.

### Uniform size, size equals power of 2

**OffsetAllocator (After Optimization)**

| Alloc Size | Min Util Ratio | Avg Util Ratio | Time (ns) |
|------------|----------------|----------------|-----------|
| 32         | 1              | 1              | 544       |
| 128        | 1              | 1              | 417       |
| 512        | 1              | 1              | 174       |
| 2048       | 1              | 1              | 406       |
| 8192       | 1              | 1              | 180       |
| 32768      | 1              | 1              | 133       |
| 131072     | 1              | 1              | 109       |
| 524288     | 1              | 1              | 100       |
| 2097152    | 1              | 1              | 99        |
| 8388608    | 1              | 1              | 99        |
| 33554432   | 1              | 1              | 98        |

**OffsetAllocator (Before Optimization)**

| Alloc Size | Min Util Ratio | Avg Util Ratio | Time (ns) |
|------------|----------------|----------------|-----------|
| 32         | 1              | 1              | 539       |
| 128        | 1              | 1              | 419       |
| 512        | 1              | 1              | 217       |
| 2048       | 1              | 1              | 408       |
| 8192       | 1              | 1              | 175       |
| 32768      | 1              | 1              | 130       |
| 131072     | 1              | 1              | 107       |
| 524288     | 1              | 1              | 99        |
| 2097152    | 1              | 1              | 100       |
| 8388608    | 1              | 1              | 98        |
| 33554432   | 1              | 1              | 98        |

### Uniform size, size equals power of 2 +/- 17

**OffsetAllocator (After Optimization)**

| Alloc Size | Min Util Ratio | Avg Util Ratio | Time (ns) |
|------------|----------------|----------------|-----------|
| 15         | 1              | 1              | 568       |
| 111        | 0.991071       | 0.991071       | 441       |
| 495        | 0.966797       | 0.966797       | 178       |
| 2031       | 0.991699       | 0.991699       | 418       |
| 8175       | 0.997925       | 0.997925       | 170       |
| 32751      | 0.999481       | 0.999481       | 133       |
| 131055     | 0.99987        | 0.99987        | 109       |
| 524271     | 0.999968       | 0.999968       | 100       |
| 2097135    | 0.999992       | 0.999992       | 99        |
| 8388591    | 0.999998       | 0.999998       | 98        |
| 33554415   | 0.999999       | 0.999999       | 99        |
| 49         | 0.942308       | 0.942308       | 508       |
| 145        | 0.906249       | 0.906249       | 372       |
| 529        | 0.918399       | 0.918399       | 172       |
| 2065       | 0.896267       | 0.896267       | 403       |
| 8209       | 0.89073        | 0.89073        | 174       |
| 32785      | 0.889347       | 0.889347       | 131       |
| 131089     | 0.88897        | 0.88897        | 105       |
| 524305     | 0.888701       | 0.888701       | 102       |
| 2097169    | 0.888679       | 0.888679       | 100       |
| 8388625    | 0.886721       | 0.886721       | 100       |
| 33554449   | 0.875          | 0.875          | 100       |

**OffsetAllocator (Before Optimization)**

| Alloc Size | Min Util Ratio | Avg Util Ratio | Time (ns) |
|------------|----------------|----------------|-----------|
| 15         | 1              | 1              | 566       |
| 111        | 0.669866       | 0.710845       | 703       |
| 495        | 0.665779       | 0.676874       | 238       |
| 2031       | 0.668333       | 0.705411       | 637       |
| 8175       | 0.666175       | 0.676474       | 242       |
| 32751      | 0.664435       | 0.669078       | 168       |
| 131055     | 0.66062        | 0.667341       | 124       |
| 524271     | 0.653055       | 0.666993       | 118       |
| 2097135    | 0.64062        | 0.666873       | 116       |
| 8388591    | 0.605468       | 0.667812       | 115       |
| 33554415   | 0.5625         | 0.670944       | 116       |
| 49         | 0.692229       | 0.753062       | 1122      |
| 145        | 0.667789       | 0.700907       | 572       |
| 529        | 0.66577        | 0.676238       | 238       |
| 2065       | 0.667926       | 0.704884       | 632       |
| 8209       | 0.665708       | 0.676372       | 239       |
| 32785      | 0.664224       | 0.669058       | 168       |
| 131089     | 0.659631       | 0.667287       | 129       |
| 524305     | 0.652609       | 0.666884       | 122       |
| 2097169    | 0.638677       | 0.666516       | 120       |
| 8388625    | 0.60547        | 0.665131       | 121       |
| 33554449   | 0.546875       | 0.660917       | 120       |

### Uniform size, size equals power of 2 multiply 0.9 or 1.1

**OffsetAllocator (After Optimization)**

| Alloc Size | Min Util Ratio | Avg Util Ratio | Time (ns) |
|------------|----------------|----------------|-----------|
| 28         | 1              | 1              | 543       |
| 115        | 0.958333       | 0.958333       | 418       |
| 460        | 0.958332       | 0.958332       | 189       |
| 1843       | 0.959896       | 0.959896       | 418       |
| 7372       | 0.959895       | 0.959895       | 197       |
| 29491      | 0.959993       | 0.959993       | 135       |
| 117964     | 0.959979       | 0.959979       | 111       |
| 471859     | 0.959985       | 0.959985       | 100       |
| 1887436    | 0.959765       | 0.959765       | 99        |
| 7549747    | 0.959766       | 0.959766       | 99        |
| 30198988   | 0.95625        | 0.95625        | 99        |
| 35         | 0.972222       | 0.972222       | 531       |
| 140        | 0.972222       | 0.972222       | 397       |
| 563        | 0.977427       | 0.977427       | 180       |
| 2252       | 0.97743        | 0.97743        | 389       |
| 9011       | 0.977752       | 0.977752       | 183       |
| 36044      | 0.977752       | 0.977752       | 133       |
| 144179     | 0.977739       | 0.977739       | 106       |
| 576716     | 0.977538       | 0.977538       | 103       |
| 2306867    | 0.977539       | 0.977539       | 99        |
| 9227468    | 0.975391       | 0.975391       | 99        |
| 36909875   | 0.9625         | 0.9625         | 100       |

**OffsetAllocator (Before Optimization)**

| Alloc Size | Min Util Ratio | Avg Util Ratio | Time (ns) |
|------------|----------------|----------------|-----------|
| 28         | 1              | 1              | 539       |
| 115        | 0.669299       | 0.709245       | 701       |
| 460        | 0.665825       | 0.677532       | 255       |
| 1843       | 0.669352       | 0.709202       | 691       |
| 7372       | 0.66619        | 0.677401       | 260       |
| 29491      | 0.664311       | 0.669511       | 172       |
| 117964     | 0.661812       | 0.667356       | 133       |
| 471859     | 0.654345       | 0.667048       | 123       |
| 1887436    | 0.640722       | 0.666447       | 121       |
| 7549747    | 0.611719       | 0.666847       | 119       |
| 30198988   | 0.548437       | 0.669799       | 125       |
| 35         | 0.7098         | 0.774162       | 1306      |
| 140        | 0.667934       | 0.702151       | 599       |
| 563        | 0.665599       | 0.675548       | 239       |
| 2252       | 0.667371       | 0.701623       | 601       |
| 9011       | 0.665485       | 0.675528       | 244       |
| 36044      | 0.663248       | 0.668912       | 170       |
| 144179     | 0.660308       | 0.666934       | 127       |
| 576716     | 0.654467       | 0.66679        | 122       |
| 2306867    | 0.633789       | 0.666159       | 121       |
| 9227468    | 0.597266       | 0.666037       | 118       |
| 36909875   | 0.55           | 0.669564       | 121       |

### Random Size

**OffsetAllocator (After Optimization)**

```
util ratio (min / p99 / p90 / p50 / max / avg):
0.544250 / 0.713338 / 0.779739 / 0.847867 / 0.952591 / 0.841576
avg alloc time: 145.575738 ns/op
```

**OffsetAllocator (Before Optimization)**

```
util ratio (min / p99 / p90 / p50 / max / avg):
0.569255 / 0.712076 / 0.781224 / 0.855046 / 0.976057 / 0.848873
avg alloc time: 142.508508 ns/op
```

**OffsetAllocator (Before Optimization)**

```
util ratio (min / p99 / p90 / p50 / max / avg):
0.569255 / 0.712076 / 0.781224 / 0.855046 / 0.976057 / 0.848873
avg alloc time: 142.508508 ns/op
```

### Paired KV/Indexer Allocation Benchmark (DSA)

In the DSA scenario, Mooncake Store stores both KV cache objects and indexer objects.
We evaluated OffsetAllocator under a paired allocation workload using object sizes derived from a GLM-5.1-FP8 DSA serving configuration.

In this configuration, Mooncake Store stores data at the page granularity, where each page contains 64 tokens. Therefore, the object sizes observed by the allocator are the packed page-level sizes rather than the per-token sizes:

- KV cache object size: 3.12 MB
- Indexer object size: 643 KB

Each benchmark round samples `N` uniformly from `[1, 128]`, then issues `N` KV cache allocations followed by `N` indexer allocations. The pool size is 1024 GB. On allocation failure, the benchmark randomly evicts 5% of all live objects and retries. The utilization ratio below is sampled only when eviction is triggered, immediately before evicting objects.

**OffsetAllocator**

```
util ratio (min / p99 / p90 / p50 / max / avg):
0.951814 / 0.951879 / 0.952077 / 0.952382 / 0.952621 / 0.952343
eviction-trigger samples: 243
avg alloc time: 352.207986 ns/op
```
