TENT Slice Spraying

TENT Slice Spraying#

Overview#

This document describes TENT’s Slice Spraying mechanism, which enables efficient data movement in multi-rail RDMA environments through intelligent device selection and adaptive load balancing.

Background#

In multi-rail RDMA environments, naive round-robin striping leads to suboptimal performance because:

NUMA Effects: Cross-NUMA access incurs additional latency and reduces effective bandwidth
Load Imbalance: Static striping cannot adapt to dynamic load conditions
Heterogeneous Link Quality: Different rails may have different effective bandwidth due to congestion or hardware characteristics

TENT addresses these issues through:

NUMA-aware device selection with configurable penalties
EWMA-based bandwidth estimation for adaptive load balancing
Dynamic multi-path allocation for large transfers

Architecture#

Device Selector#

The DeviceSelector component is responsible for choosing which RDMA device(s) to use for each transfer request. It operates in two modes:

Baseline Mode (Round-Robin)#

When enable_smart_scheduling = false, the selector uses simple round-robin within the highest-priority device tier (typically local NUMA devices):

For each request:
Find first non-empty device tier (local NUMA preferred)
Select devices round-robin within that tier
Ignore lower-priority tiers

Characteristics:

Deterministic behavior
No runtime overhead for tracking
Consistent with original TE behavior
Does not adapt to load conditions

Smart Mode (EWMA-Based Selection)#

When enable_smart_scheduling = true, the selector uses an EWMA-based algorithm:

For each request:
  1. Calculate predicted completion time for each device:
     predicted_time = (inflight_bytes + slice_bytes) / ewma_bandwidth

  2. Apply NUMA penalty based on tier:
     score = predicted_time × numa_tier_weights[tier]

  3. Select device(s) with minimum score:
     - Single slice: best device only
     - Multiple slices: weighted distribution across devices

  4. Update EWMA bandwidth on completion:
     ewma_bandwidth = α × ewma_bandwidth + (1 - α) × observed_bandwidth
     where α = bandwidth_learning_rate

Characteristics:

Adapts to changing load conditions
Prefers local NUMA devices
Spreads load across multiple rails
Higher runtime overhead

NUMA-Aware Selection#

Devices are organized into tiers based on NUMA distance:

Tier	Description	Default Penalty
Rank 0	Local NUMA	1.0 (baseline)
Rank 1	Remote NUMA (tier 1)	5.0
Rank 2	Remote NUMA (tier 2)	10.0

The penalty is applied as a multiplier to predicted completion time, making remote devices less attractive unless local devices are heavily loaded.

EWMA Bandwidth Estimation#

Each device maintains an EWMA (Exponentially Weighted Moving Average) of its effective bandwidth:

initial_value = theoretical_bandwidth

on_transfer_complete:
  observed_bandwidth = transfer_size / transfer_time
  ewma_bandwidth = α × ewma_bandwidth + (1 - α) × observed_bandwidth
  ewma_bandwidth = clamp(ewma_bandwidth,
                        0.1 × theoretical,
                        10.0 × theoretical)

where α = bandwidth_learning_rate.

Note on terminology: The EWMA formula uses α as the coefficient for the old value. Therefore:

Lower α (closer to 0) → more weight on new observations → faster adaptation
Higher α (closer to 1) → more weight on old value → slower adaptation

Examples:

α = 0: ewma_bandwidth = observed_bandwidth (full adaptation, always use new value)
α = 1: ewma_bandwidth = ewma_bandwidth (no learning, never update)
α = 0.01: ewma_bandwidth = 0.01 × old + 0.99 × new (default, gradual adaptation)

The EWMA provides:

Memory: Recent observations have more influence than old ones
Stability: Smooths out transient fluctuations
Adaptability: Tracks gradual changes in link quality

Multi-Path Allocation#

For large transfers, TENT distributes slices across multiple devices:

Single Path (small requests):

All slices go to the single best device
Minimizes coordination overhead

Multi Path (large requests):

Normal mode (99% of calls): Slices distributed proportionally to device capacity
- Each device gets: (device_weight / total_weight) × num_slices
- Remaining slices assigned to best device
Probe mode (1% of calls, every 100th call): Slices distributed round-robin
- Purpose: Ensure all devices are continuously sampled for EWMA updates
- Prevents EWMA starvation for less-used devices

Request Flow#

┌──────────────┐
│ Application  │
└──────┬───────┘
       │ submitTransfer()
       ▼
┌──────────────────────────────────────┐
│  RdmaTransport::submitTransferTasks  │
│  - Split large requests into slices   │
│  - Call DeviceSelector for allocation │
│  - Only if num_slices >= max_slice_count/2 │
└──────┬───────────────────────────────┘
       │
       ▼
┌──────────────────────────────────────┐
│     DeviceSelector::allocate         │
│  ┌────────────────────────────────┐  │
│  │ smart_selection_enabled?       │  │
│  └────┬──────────────────────┬────┘  │
│       │ Yes                  │ No     │
│       ▼                     ▼        │
│  ┌─────────┐          ┌─────────┐   │
│  │  Smart  │          │ Baseline│   │
│  │  Mode   │          │   Mode  │   │
│  └────┬────┘          └────┬────┘   │
│       │                    │         │
│       └────────┬───────────┘         │
│                ▼                     │
│  ┌────────────────────────────────┐  │
│  │  Return slice_dev_ids          │  │
│  └────────────────────────────────┘  │
└──────────────────────────────────────┘

Configuration#

All slice spraying parameters are configurable via the configuration file:

Core Scheduling#

{
  "transports": {
    "rdma": {
      "enable_smart_scheduling": true
    }
  }
}

Parameter	Type	Default	Description
`enable_smart_scheduling`	bool	`true`	Enable EWMA-based selection (false = round-robin)

NUMA Penalties#

{
  "transports": {
    "rdma": {
      "numa_penalties": [1.0, 5.0, 10.0]
    }
  }
}

Parameter	Type	Default	Description
`numa_penalties`	array[float]	`[1.0, 5.0, 10.0]`	Penalty multipliers for each NUMA tier

Guidelines:

Higher values = stronger preference for local devices
Set all to 1.0 to disable NUMA awareness
Increase remote penalties if cross-NUMA latency is high

Bandwidth Estimation#

{
  "transports": {
    "rdma": {
      "bandwidth_learning_rate": 0.01,
      "ewma_min_bandwidth_multiplier": 0.1,
      "ewma_max_bandwidth_multiplier": 10.0
    }
  }
}

Parameter	Type	Default	Description
`bandwidth_learning_rate`	float	`0.01`	EWMA learning rate (0.0 = full adaptation, 1.0 = no learning)
`ewma_min_bandwidth_multiplier`	float	`0.1`	Minimum bandwidth as fraction of theoretical
`ewma_max_bandwidth_multiplier`	float	`10.0`	Maximum bandwidth as fraction of theoretical

Guidelines:

Lower α (e.g., 0.001) → faster adaptation, more volatile → responds quickly to changes
Higher α (e.g., 0.1) → slower adaptation, more stable → smooths out transient fluctuations
Default α = 0.01 provides balanced adaptation
Multipliers constrain EWMA to reasonable range [0.1×, 10.0×] of theoretical bandwidth

Device Selection Scoring#

{
  "transports": {
    "rdma": {
      "score_jitter_range": 1e-9,
      "score_epsilon": 1e-12
    }
  }
}

Parameter	Type	Default	Description
`score_jitter_range`	float	`1e-9`	Random jitter range to avoid deterministic selection
`score_epsilon`	float	`1e-12`	Small value to prevent division by zero

Bandwidth Constants#

{
  "transports": {
    "rdma": {
      "default_bandwidth_gbps": 400.0,
      "min_bandwidth_gbps": 10.0,
      "max_bandwidth_gbps": 800.0
    }
  }
}

Parameter	Type	Default	Description
`default_bandwidth_gbps`	float	`400.0`	Default NIC bandwidth when topology info unavailable
`min_bandwidth_gbps`	float	`10.0`	Minimum valid NIC bandwidth (Gbps)
`max_bandwidth_gbps`	float	`800.0`	Maximum valid NIC bandwidth (Gbps)

Notes:

These constants define the valid range and default for device bandwidth
Used in EWMA calculations and theoretical bandwidth estimation
If a device’s reported bandwidth is outside [min, max], default_bandwidth is used

Usage Examples#

Example 1: Latency-Sensitive Workload#

For latency-sensitive queries where local NUMA access is critical:

{
  "transports": {
    "rdma": {
      "enable_smart_scheduling": true,
      "numa_penalties": [1.0, 100.0, 1000.0],
      "bandwidth_learning_rate": 0.001
    }
  }
}

Effect: Strongly prefers local devices, slow adaptation for stability.

Example 2: Bulk Data Transfer#

For bulk transfers where throughput is more important than latency:

{
  "transports": {
    "rdma": {
      "enable_smart_scheduling": true,
      "numa_penalties": [1.0, 2.0, 3.0],
      "bandwidth_learning_rate": 0.1
    }
  }
}

Effect: Allows cross-NUMA transfers, fast adaptation to load.

Example 3: Baseline Mode#

For deterministic performance matching original TE:

{
  "transports": {
    "rdma": {
      "enable_smart_scheduling": false
    }
  }
}

Effect: Round-robin within local NUMA tier, no adaptation, minimal overhead.

Performance Considerations#

Overhead Comparison#

Mode	CPU Overhead	Adaptability	NUMA Awareness
Baseline	Minimal	None	Tier-based (static)
Smart	Moderate	EWMA-based	Dynamic + penalty

When to Use Each Mode#

Use Baseline Mode when:

Workload is uniform and predictable
Deterministic performance is required
CPU overhead must be minimized
All devices are in same NUMA node

Use Smart Mode when:

Workload is heterogeneous
Link quality varies over time
NUMA effects are significant
Maximum throughput is desired

Tuning Guidelines#

Start with baseline mode to establish performance baseline
Enable smart mode with conservative parameters:
- numa_penalties = [1.0, 2.0, 5.0]
- bandwidth_learning_rate = 0.01
Monitor performance and adjust based on observations:
- If cross-NUMA transfers are too frequent: increase remote penalties
- If adaptation is too slow (EWMA not keeping up with load changes): decrease α
- If performance is unstable (too much fluctuation): increase α

Troubleshooting#

Problem: All requests go to cross-NUMA devices#

Symptoms: Poor performance, high latency

Diagnosis:

device_selector_->printTrafficStats();

Solution: Check numa_penalties configuration. Ensure local devices have lowest penalty (1.0).

Problem: Performance worse than baseline#

Symptoms: Smart mode slower than baseline mode

Possible causes:

Learning rate too high (volatile decisions)
NUMA penalties too low (not preferring local)
Score jitter too large (too much randomness)

Solution: Use more conservative:

{
  "bandwidth_learning_rate": 0.001,
  "numa_penalties": [1.0, 10.0, 100.0],
  "score_jitter_range": 1e-12
}