HOPS

Heterogeneous Optimized Pipeline Simulator

A Python-based discrete event simulator for modeling pipeline parallel training across configurable hardware topologies, communication latencies, failure modes, and scheduling strategies.

Jimmy Dai, Joshua Hsueh, Olaf Dsouza, Rishith Seelam
University of Michigan

Capabilities

What HOPS Does

Simulate and analyze pipeline-parallel training with fine-grained control over hardware, scheduling, and failure scenarios.

Discrete Event Simulation

Priority-queue-driven event engine processes timestamped events for accurate and deterministic simulation of pipeline parallelism.

Pluggable Schedulers

Ships with GPipe and 1F1B scheduling policies. Register custom schedulers via a simple plugin API to explore new strategies.

Hardware Topology Modeling

Define GPU/CPU devices, inter-device links with bandwidth constraints, and per-link jitter to model realistic cluster topologies.

📈

Stochastic Latency Models

Choose from constant, normal, heavy-tailed (Pareto), and Poisson distributions for compute and communication latencies.

💥

Failure Injection

Chaos Monkey-style fault injection for devices and links with configurable failure probabilities and automatic recovery.

📊

Rich Metrics & Visualization

Gantt timeline charts, utilization dashboards, throughput, latency percentiles, bubble ratios, and communication overhead reports.

Design

Architecture

A modular design where all components plug into a central discrete event loop.

Hardware
device.py / network.py
Event Engine
event_engine.py
Pipeline
pipeline.py
Scheduler
scheduler.py
Metrics
collector.py / reporter.py
Viz
timeline.py / dashboard.py

Deterministic by Design

An explicit np.random.Generator seeded from the YAML config is threaded through all stochastic components, ensuring reproducible results.

Plugin Registry

Schedulers are registered via register_scheduler(), making it trivial to add and benchmark new scheduling policies without modifying core code.

Project Structure

src/hops/
  core/
    event_engine.py # Discrete event simulation loop
    pipeline.py # Pipeline stages & dataflow
    scheduler.py # Scheduling policies & plugin registry
    timing.py # Failure-aware timing
    types.py # Core enums & dataclasses
  hardware/
    device.py # GPU/CPU abstractions
    network.py # Communication links
    topology.py # Device graph
  latency/
    distributions.py # Configurable distributions
    compute_model.py # Per-stage compute time
  failure/
    engine.py # Chaos Monkey failure injection
  metrics/
    collector.py # Runtime statistics
    reporter.py # Summary reports
  viz/
    timeline.py # Gantt-style timeline
    dashboard.py # 4-panel summary dashboard

Configuration

YAML-Driven Experiments

Define every aspect of your simulation through declarative YAML configs.

pipeline & simulation
# Simulation parameters
simulation:
  num_microbatches: 8
  num_batches: 4
  seed: 42

# Pipeline stages
pipeline:
  stages:
    - id: 0
      device: node0_gpu0
      compute_latency:
        type: normal
        mean: 5.0
        std: 0.5
  backward_factor: 2.0

# Scheduler policy
scheduler:
  policy: 1f1b
hardware & failure
# Hardware topology
hardware:
  devices:
    - id: node0_gpu0
      kind: gpu
      memory_mb: 81920
  links:
    - src: node0_gpu0
      dst: node0_gpu1
      bandwidth_gbps: 900
      base_latency_us: 1.0
  activation_size_mb: 50

# Failure injection
failure:
  enabled: false
  check_interval: 10.0
  device_fail_prob: 0.001
  link_fail_prob: 0.0005
  recovery_time: 5.0

Latency Distributions

Distribution Description Config Example
constant Deterministic fixed value { type: constant, value: 5.0 }
normal Gaussian, clamped to non-negative { type: normal, mean: 5.0, std: 0.5 }
heavy_tailed Pareto, for modeling stragglers { type: heavy_tailed, base: 5.0, alpha: 2.5 }
poisson Discrete count distribution { type: poisson, lam: 5.0 }

Analysis

Metrics & Visualization

HOPS provides comprehensive performance analysis with built-in visualizations.

Throughput

Micro-batches per ms

End-to-End Latency

p50, p99, mean per micro-batch

Bubble Ratio

Idle device-time fraction

Per-Stage Utilization

Compute / total time per stage

Communication Overhead

Transfer time vs compute

Optimizer Step

All-reduce & weight update time

Failure Impact

Fault count & total downtime

Timeline Visualization

Gantt chart (output/timeline.png) showing per-device compute tasks for forward and backward passes, annotated with failure markers.

Summary Dashboard

4-panel dashboard (output/dashboard.png) with per-stage utilization, latency histogram, bubble ratio, and compute vs. communication breakdown.

Extensibility

Custom Scheduling Policies

HOPS ships with GPipe and 1F1B. Add your own with the plugin API.

custom_scheduler.py
from hops import register_scheduler
from hops.core.scheduler import Scheduler, PipelineState
from hops.core.types import StageTask, Phase

class MyScheduler(Scheduler):
    def next_tasks(self, state: PipelineState) -> list[StageTask]:
        # state.is_task_ready(stage, mb, phase)
        # state.completed_count(stage, phase)
        # state.stage_is_busy(stage)
        # state.stage_in_flight_count(stage)
        # state.all_forwards_completed()
        ...

register_scheduler("my_policy", MyScheduler)

Then set policy: my_policy in your YAML config.

Getting Started

Setup & Usage

Get up and running in under a minute.

1

Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh
2

Clone & install

Requires Python 3.13+

git clone <repo-url> && cd HOPS
uv sync
3

Run a simulation

# Default config
uv run python main.py

# Custom config
uv run python main.py --config my_config.yaml

# Skip visualization
uv run python main.py --no-viz
4

Run tests

uv run pytest              # all 48 tests
uv run pytest tests/test_pipeline.py  # specific file
uv run pytest -v           # verbose output