TENT Slice Spraying#

Overview#

This document describes TENT’s Slice Spraying mechanism, which enables efficient data movement in multi-rail RDMA environments through intelligent device selection and adaptive load balancing.

Background#

In multi-rail RDMA environments, naive round-robin striping leads to suboptimal performance because:

  1. NUMA Effects: Cross-NUMA access incurs additional latency and reduces effective bandwidth

  2. Load Imbalance: Static striping cannot adapt to dynamic load conditions

  3. Heterogeneous Link Quality: Different rails may have different effective bandwidth due to congestion or hardware characteristics

TENT addresses these issues through:

  • NUMA-aware device selection with configurable penalties

  • EWMA-based bandwidth estimation for adaptive load balancing

  • Dynamic multi-path allocation for large transfers

Architecture#

Device Selector#

The DeviceSelector component is responsible for choosing which RDMA device(s) to use for each transfer request. It operates in two modes:

Baseline Mode (Round-Robin)#

When enable_smart_scheduling = false, the selector uses simple round-robin within the highest-priority device tier (typically local NUMA devices):

For each request:
  1. Find first non-empty device tier (local NUMA preferred)
  2. Select devices round-robin within that tier
  3. Ignore lower-priority tiers

Characteristics:

  • Deterministic behavior

  • No runtime overhead for tracking

  • Consistent with original TE behavior

  • Does not adapt to load conditions

Smart Mode (EWMA-Based Selection)#

When enable_smart_scheduling = true, the selector uses an EWMA-based algorithm:

For each request:
  1. Calculate predicted completion time for each device:
     predicted_time = (inflight_bytes + slice_bytes) / ewma_bandwidth

  2. Apply NUMA penalty based on tier:
     score = predicted_time × numa_tier_weights[tier]

  3. Select device(s) with minimum score:
     - Single slice: best device only
     - Multiple slices: weighted distribution across devices

  4. Update EWMA bandwidth on completion:
     ewma_bandwidth = α × ewma_bandwidth + (1 - α) × observed_bandwidth
     where α = bandwidth_learning_rate

Characteristics:

  • Adapts to changing load conditions

  • Prefers local NUMA devices

  • Spreads load across multiple rails

  • Higher runtime overhead

NUMA-Aware Selection#

Devices are organized into tiers based on NUMA distance:

Tier

Description

Default Penalty

Rank 0

Local NUMA

1.0 (baseline)

Rank 1

Remote NUMA (tier 1)

5.0

Rank 2

Remote NUMA (tier 2)

10.0

The penalty is applied as a multiplier to predicted completion time, making remote devices less attractive unless local devices are heavily loaded.

EWMA Bandwidth Estimation#

Each device maintains an EWMA (Exponentially Weighted Moving Average) of its effective bandwidth:

initial_value = theoretical_bandwidth

on_transfer_complete:
  observed_bandwidth = transfer_size / transfer_time
  ewma_bandwidth = α × ewma_bandwidth + (1 - α) × observed_bandwidth
  ewma_bandwidth = clamp(ewma_bandwidth,
                        0.1 × theoretical,
                        10.0 × theoretical)

where α = bandwidth_learning_rate.

Note on terminology: The EWMA formula uses α as the coefficient for the old value. Therefore:

  • Lower α (closer to 0) → more weight on new observations → faster adaptation

  • Higher α (closer to 1) → more weight on old value → slower adaptation

Examples:

  • α = 0: ewma_bandwidth = observed_bandwidth (full adaptation, always use new value)

  • α = 1: ewma_bandwidth = ewma_bandwidth (no learning, never update)

  • α = 0.01: ewma_bandwidth = 0.01 × old + 0.99 × new (default, gradual adaptation)

The EWMA provides:

  • Memory: Recent observations have more influence than old ones

  • Stability: Smooths out transient fluctuations

  • Adaptability: Tracks gradual changes in link quality

Multi-Path Allocation#

For large transfers, TENT distributes slices across multiple devices:

Single Path (small requests):

  • All slices go to the single best device

  • Minimizes coordination overhead

Multi Path (large requests):

  • Normal mode (99% of calls): Slices distributed proportionally to device capacity

    • Each device gets: (device_weight / total_weight) × num_slices

    • Remaining slices assigned to best device

  • Probe mode (1% of calls, every 100th call): Slices distributed round-robin

    • Purpose: Ensure all devices are continuously sampled for EWMA updates

    • Prevents EWMA starvation for less-used devices

Request Flow#

┌──────────────┐
│ Application  │
└──────┬───────┘
       │ submitTransfer()
       ▼
┌──────────────────────────────────────┐
│  RdmaTransport::submitTransferTasks  │
│  - Split large requests into slices   │
│  - Call DeviceSelector for allocation │
│  - Only if num_slices >= max_slice_count/2 │
└──────┬───────────────────────────────┘
       │
       ▼
┌──────────────────────────────────────┐
│     DeviceSelector::allocate         │
│  ┌────────────────────────────────┐  │
│  │ smart_selection_enabled?       │  │
│  └────┬──────────────────────┬────┘  │
│       │ Yes                  │ No     │
│       ▼                     ▼        │
│  ┌─────────┐          ┌─────────┐   │
│  │  Smart  │          │ Baseline│   │
│  │  Mode   │          │   Mode  │   │
│  └────┬────┘          └────┬────┘   │
│       │                    │         │
│       └────────┬───────────┘         │
│                ▼                     │
│  ┌────────────────────────────────┐  │
│  │  Return slice_dev_ids          │  │
│  └────────────────────────────────┘  │
└──────────────────────────────────────┘

Configuration#

All slice spraying parameters are configurable via the configuration file:

Core Scheduling#

{
  "transports": {
    "rdma": {
      "enable_smart_scheduling": true
    }
  }
}

Parameter

Type

Default

Description

enable_smart_scheduling

bool

true

Enable EWMA-based selection (false = round-robin)

NUMA Penalties#

{
  "transports": {
    "rdma": {
      "numa_penalties": [1.0, 5.0, 10.0]
    }
  }
}

Parameter

Type

Default

Description

numa_penalties

array[float]

[1.0, 5.0, 10.0]

Penalty multipliers for each NUMA tier

Guidelines:

  • Higher values = stronger preference for local devices

  • Set all to 1.0 to disable NUMA awareness

  • Increase remote penalties if cross-NUMA latency is high

Bandwidth Estimation#

{
  "transports": {
    "rdma": {
      "bandwidth_learning_rate": 0.01,
      "ewma_min_bandwidth_multiplier": 0.1,
      "ewma_max_bandwidth_multiplier": 10.0
    }
  }
}

Parameter

Type

Default

Description

bandwidth_learning_rate

float

0.01

EWMA learning rate (0.0 = full adaptation, 1.0 = no learning)

ewma_min_bandwidth_multiplier

float

0.1

Minimum bandwidth as fraction of theoretical

ewma_max_bandwidth_multiplier

float

10.0

Maximum bandwidth as fraction of theoretical

Guidelines:

  • Lower α (e.g., 0.001) → faster adaptation, more volatile → responds quickly to changes

  • Higher α (e.g., 0.1) → slower adaptation, more stable → smooths out transient fluctuations

  • Default α = 0.01 provides balanced adaptation

  • Multipliers constrain EWMA to reasonable range [0.1×, 10.0×] of theoretical bandwidth

Device Selection Scoring#

{
  "transports": {
    "rdma": {
      "score_jitter_range": 1e-9,
      "score_epsilon": 1e-12
    }
  }
}

Parameter

Type

Default

Description

score_jitter_range

float

1e-9

Random jitter range to avoid deterministic selection

score_epsilon

float

1e-12

Small value to prevent division by zero

Bandwidth Constants#

{
  "transports": {
    "rdma": {
      "default_bandwidth_gbps": 400.0,
      "min_bandwidth_gbps": 10.0,
      "max_bandwidth_gbps": 800.0
    }
  }
}

Parameter

Type

Default

Description

default_bandwidth_gbps

float

400.0

Default NIC bandwidth when topology info unavailable

min_bandwidth_gbps

float

10.0

Minimum valid NIC bandwidth (Gbps)

max_bandwidth_gbps

float

800.0

Maximum valid NIC bandwidth (Gbps)

Notes:

  • These constants define the valid range and default for device bandwidth

  • Used in EWMA calculations and theoretical bandwidth estimation

  • If a device’s reported bandwidth is outside [min, max], default_bandwidth is used

Usage Examples#

Example 1: Latency-Sensitive Workload#

For latency-sensitive queries where local NUMA access is critical:

{
  "transports": {
    "rdma": {
      "enable_smart_scheduling": true,
      "numa_penalties": [1.0, 100.0, 1000.0],
      "bandwidth_learning_rate": 0.001
    }
  }
}

Effect: Strongly prefers local devices, slow adaptation for stability.

Example 2: Bulk Data Transfer#

For bulk transfers where throughput is more important than latency:

{
  "transports": {
    "rdma": {
      "enable_smart_scheduling": true,
      "numa_penalties": [1.0, 2.0, 3.0],
      "bandwidth_learning_rate": 0.1
    }
  }
}

Effect: Allows cross-NUMA transfers, fast adaptation to load.

Example 3: Baseline Mode#

For deterministic performance matching original TE:

{
  "transports": {
    "rdma": {
      "enable_smart_scheduling": false
    }
  }
}

Effect: Round-robin within local NUMA tier, no adaptation, minimal overhead.

Performance Considerations#

Overhead Comparison#

Mode

CPU Overhead

Adaptability

NUMA Awareness

Baseline

Minimal

None

Tier-based (static)

Smart

Moderate

EWMA-based

Dynamic + penalty

When to Use Each Mode#

Use Baseline Mode when:

  • Workload is uniform and predictable

  • Deterministic performance is required

  • CPU overhead must be minimized

  • All devices are in same NUMA node

Use Smart Mode when:

  • Workload is heterogeneous

  • Link quality varies over time

  • NUMA effects are significant

  • Maximum throughput is desired

Tuning Guidelines#

  1. Start with baseline mode to establish performance baseline

  2. Enable smart mode with conservative parameters:

    • numa_penalties = [1.0, 2.0, 5.0]

    • bandwidth_learning_rate = 0.01

  3. Monitor performance and adjust based on observations:

    • If cross-NUMA transfers are too frequent: increase remote penalties

    • If adaptation is too slow (EWMA not keeping up with load changes): decrease α

    • If performance is unstable (too much fluctuation): increase α

Troubleshooting#

Problem: All requests go to cross-NUMA devices#

Symptoms: Poor performance, high latency

Diagnosis:

device_selector_->printTrafficStats();

Solution: Check numa_penalties configuration. Ensure local devices have lowest penalty (1.0).

Problem: Performance worse than baseline#

Symptoms: Smart mode slower than baseline mode

Possible causes:

  1. Learning rate too high (volatile decisions)

  2. NUMA penalties too low (not preferring local)

  3. Score jitter too large (too much randomness)

Solution: Use more conservative:

{
  "bandwidth_learning_rate": 0.001,
  "numa_penalties": [1.0, 10.0, 100.0],
  "score_jitter_range": 1e-12
}

References#