TENT Slice Spraying#
Overview#
This document describes TENT’s Slice Spraying mechanism, which enables efficient data movement in multi-rail RDMA environments through intelligent device selection and adaptive load balancing.
Background#
In multi-rail RDMA environments, naive round-robin striping leads to suboptimal performance because:
NUMA Effects: Cross-NUMA access incurs additional latency and reduces effective bandwidth
Load Imbalance: Static striping cannot adapt to dynamic load conditions
Heterogeneous Link Quality: Different rails may have different effective bandwidth due to congestion or hardware characteristics
TENT addresses these issues through:
NUMA-aware device selection with configurable penalties
EWMA-based bandwidth estimation for adaptive load balancing
Dynamic multi-path allocation for large transfers
Architecture#
Device Selector#
The DeviceSelector component is responsible for choosing which RDMA device(s) to use for each transfer request. It operates in two modes:
Baseline Mode (Round-Robin)#
When enable_smart_scheduling = false, the selector uses simple round-robin within the highest-priority device tier (typically local NUMA devices):
For each request:
1. Find first non-empty device tier (local NUMA preferred)
2. Select devices round-robin within that tier
3. Ignore lower-priority tiers
Characteristics:
Deterministic behavior
No runtime overhead for tracking
Consistent with original TE behavior
Does not adapt to load conditions
Smart Mode (EWMA-Based Selection)#
When enable_smart_scheduling = true, the selector uses an EWMA-based algorithm:
For each request:
1. Calculate predicted completion time for each device:
predicted_time = (inflight_bytes + slice_bytes) / ewma_bandwidth
2. Apply NUMA penalty based on tier:
score = predicted_time × numa_tier_weights[tier]
3. Select device(s) with minimum score:
- Single slice: best device only
- Multiple slices: weighted distribution across devices
4. Update EWMA bandwidth on completion:
ewma_bandwidth = α × ewma_bandwidth + (1 - α) × observed_bandwidth
where α = bandwidth_learning_rate
Characteristics:
Adapts to changing load conditions
Prefers local NUMA devices
Spreads load across multiple rails
Higher runtime overhead
NUMA-Aware Selection#
Devices are organized into tiers based on NUMA distance:
Tier |
Description |
Default Penalty |
|---|---|---|
Rank 0 |
Local NUMA |
1.0 (baseline) |
Rank 1 |
Remote NUMA (tier 1) |
5.0 |
Rank 2 |
Remote NUMA (tier 2) |
10.0 |
The penalty is applied as a multiplier to predicted completion time, making remote devices less attractive unless local devices are heavily loaded.
EWMA Bandwidth Estimation#
Each device maintains an EWMA (Exponentially Weighted Moving Average) of its effective bandwidth:
initial_value = theoretical_bandwidth
on_transfer_complete:
observed_bandwidth = transfer_size / transfer_time
ewma_bandwidth = α × ewma_bandwidth + (1 - α) × observed_bandwidth
ewma_bandwidth = clamp(ewma_bandwidth,
0.1 × theoretical,
10.0 × theoretical)
where α = bandwidth_learning_rate.
Note on terminology: The EWMA formula uses α as the coefficient for the old value. Therefore:
Lower α (closer to 0) → more weight on new observations → faster adaptation
Higher α (closer to 1) → more weight on old value → slower adaptation
Examples:
α = 0:
ewma_bandwidth = observed_bandwidth(full adaptation, always use new value)α = 1:
ewma_bandwidth = ewma_bandwidth(no learning, never update)α = 0.01:
ewma_bandwidth = 0.01 × old + 0.99 × new(default, gradual adaptation)
The EWMA provides:
Memory: Recent observations have more influence than old ones
Stability: Smooths out transient fluctuations
Adaptability: Tracks gradual changes in link quality
Multi-Path Allocation#
For large transfers, TENT distributes slices across multiple devices:
Single Path (small requests):
All slices go to the single best device
Minimizes coordination overhead
Multi Path (large requests):
Normal mode (99% of calls): Slices distributed proportionally to device capacity
Each device gets:
(device_weight / total_weight) × num_slicesRemaining slices assigned to best device
Probe mode (1% of calls, every 100th call): Slices distributed round-robin
Purpose: Ensure all devices are continuously sampled for EWMA updates
Prevents EWMA starvation for less-used devices
Request Flow#
┌──────────────┐
│ Application │
└──────┬───────┘
│ submitTransfer()
▼
┌──────────────────────────────────────┐
│ RdmaTransport::submitTransferTasks │
│ - Split large requests into slices │
│ - Call DeviceSelector for allocation │
│ - Only if num_slices >= max_slice_count/2 │
└──────┬───────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ DeviceSelector::allocate │
│ ┌────────────────────────────────┐ │
│ │ smart_selection_enabled? │ │
│ └────┬──────────────────────┬────┘ │
│ │ Yes │ No │
│ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │ Smart │ │ Baseline│ │
│ │ Mode │ │ Mode │ │
│ └────┬────┘ └────┬────┘ │
│ │ │ │
│ └────────┬───────────┘ │
│ ▼ │
│ ┌────────────────────────────────┐ │
│ │ Return slice_dev_ids │ │
│ └────────────────────────────────┘ │
└──────────────────────────────────────┘
Configuration#
All slice spraying parameters are configurable via the configuration file:
Core Scheduling#
{
"transports": {
"rdma": {
"enable_smart_scheduling": true
}
}
}
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable EWMA-based selection (false = round-robin) |
NUMA Penalties#
{
"transports": {
"rdma": {
"numa_penalties": [1.0, 5.0, 10.0]
}
}
}
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
array[float] |
|
Penalty multipliers for each NUMA tier |
Guidelines:
Higher values = stronger preference for local devices
Set all to
1.0to disable NUMA awarenessIncrease remote penalties if cross-NUMA latency is high
Bandwidth Estimation#
{
"transports": {
"rdma": {
"bandwidth_learning_rate": 0.01,
"ewma_min_bandwidth_multiplier": 0.1,
"ewma_max_bandwidth_multiplier": 10.0
}
}
}
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
float |
|
EWMA learning rate (0.0 = full adaptation, 1.0 = no learning) |
|
float |
|
Minimum bandwidth as fraction of theoretical |
|
float |
|
Maximum bandwidth as fraction of theoretical |
Guidelines:
Lower α (e.g., 0.001) → faster adaptation, more volatile → responds quickly to changes
Higher α (e.g., 0.1) → slower adaptation, more stable → smooths out transient fluctuations
Default α = 0.01 provides balanced adaptation
Multipliers constrain EWMA to reasonable range [0.1×, 10.0×] of theoretical bandwidth
Device Selection Scoring#
{
"transports": {
"rdma": {
"score_jitter_range": 1e-9,
"score_epsilon": 1e-12
}
}
}
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
float |
|
Random jitter range to avoid deterministic selection |
|
float |
|
Small value to prevent division by zero |
Bandwidth Constants#
{
"transports": {
"rdma": {
"default_bandwidth_gbps": 400.0,
"min_bandwidth_gbps": 10.0,
"max_bandwidth_gbps": 800.0
}
}
}
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
float |
|
Default NIC bandwidth when topology info unavailable |
|
float |
|
Minimum valid NIC bandwidth (Gbps) |
|
float |
|
Maximum valid NIC bandwidth (Gbps) |
Notes:
These constants define the valid range and default for device bandwidth
Used in EWMA calculations and theoretical bandwidth estimation
If a device’s reported bandwidth is outside [min, max], default_bandwidth is used
Usage Examples#
Example 1: Latency-Sensitive Workload#
For latency-sensitive queries where local NUMA access is critical:
{
"transports": {
"rdma": {
"enable_smart_scheduling": true,
"numa_penalties": [1.0, 100.0, 1000.0],
"bandwidth_learning_rate": 0.001
}
}
}
Effect: Strongly prefers local devices, slow adaptation for stability.
Example 2: Bulk Data Transfer#
For bulk transfers where throughput is more important than latency:
{
"transports": {
"rdma": {
"enable_smart_scheduling": true,
"numa_penalties": [1.0, 2.0, 3.0],
"bandwidth_learning_rate": 0.1
}
}
}
Effect: Allows cross-NUMA transfers, fast adaptation to load.
Example 3: Baseline Mode#
For deterministic performance matching original TE:
{
"transports": {
"rdma": {
"enable_smart_scheduling": false
}
}
}
Effect: Round-robin within local NUMA tier, no adaptation, minimal overhead.
Performance Considerations#
Overhead Comparison#
Mode |
CPU Overhead |
Adaptability |
NUMA Awareness |
|---|---|---|---|
Baseline |
Minimal |
None |
Tier-based (static) |
Smart |
Moderate |
EWMA-based |
Dynamic + penalty |
When to Use Each Mode#
Use Baseline Mode when:
Workload is uniform and predictable
Deterministic performance is required
CPU overhead must be minimized
All devices are in same NUMA node
Use Smart Mode when:
Workload is heterogeneous
Link quality varies over time
NUMA effects are significant
Maximum throughput is desired
Tuning Guidelines#
Start with baseline mode to establish performance baseline
Enable smart mode with conservative parameters:
numa_penalties = [1.0, 2.0, 5.0]bandwidth_learning_rate = 0.01
Monitor performance and adjust based on observations:
If cross-NUMA transfers are too frequent: increase remote penalties
If adaptation is too slow (EWMA not keeping up with load changes): decrease α
If performance is unstable (too much fluctuation): increase α
Troubleshooting#
Problem: All requests go to cross-NUMA devices#
Symptoms: Poor performance, high latency
Diagnosis:
device_selector_->printTrafficStats();
Solution: Check numa_penalties configuration. Ensure local devices have lowest penalty (1.0).
Problem: Performance worse than baseline#
Symptoms: Smart mode slower than baseline mode
Possible causes:
Learning rate too high (volatile decisions)
NUMA penalties too low (not preferring local)
Score jitter too large (too much randomness)
Solution: Use more conservative:
{
"bandwidth_learning_rate": 0.001,
"numa_penalties": [1.0, 10.0, 100.0],
"score_jitter_range": 1e-12
}