Heterogeneous Optimized Pipeline Simulator
A Python-based discrete event simulator for modeling pipeline parallel training across configurable hardware topologies, communication latencies, failure modes, and scheduling strategies.
Capabilities
Simulate and analyze pipeline-parallel training with fine-grained control over hardware, scheduling, and failure scenarios.
Priority-queue-driven event engine processes timestamped events for accurate and deterministic simulation of pipeline parallelism.
Ships with GPipe and 1F1B scheduling policies. Register custom schedulers via a simple plugin API to explore new strategies.
Define GPU/CPU devices, inter-device links with bandwidth constraints, and per-link jitter to model realistic cluster topologies.
Choose from constant, normal, heavy-tailed (Pareto), and Poisson distributions for compute and communication latencies.
Chaos Monkey-style fault injection for devices and links with configurable failure probabilities and automatic recovery.
Gantt timeline charts, utilization dashboards, throughput, latency percentiles, bubble ratios, and communication overhead reports.
Design
A modular design where all components plug into a central discrete event loop.
An explicit np.random.Generator seeded from the YAML config is threaded through all stochastic components, ensuring reproducible results.
Schedulers are registered via register_scheduler(), making it trivial to add and benchmark new scheduling policies without modifying core code.
Configuration
Define every aspect of your simulation through declarative YAML configs.
# Simulation parameters simulation: num_microbatches: 8 num_batches: 4 seed: 42 # Pipeline stages pipeline: stages: - id: 0 device: node0_gpu0 compute_latency: type: normal mean: 5.0 std: 0.5 backward_factor: 2.0 # Scheduler policy scheduler: policy: 1f1b
# Hardware topology hardware: devices: - id: node0_gpu0 kind: gpu memory_mb: 81920 links: - src: node0_gpu0 dst: node0_gpu1 bandwidth_gbps: 900 base_latency_us: 1.0 activation_size_mb: 50 # Failure injection failure: enabled: false check_interval: 10.0 device_fail_prob: 0.001 link_fail_prob: 0.0005 recovery_time: 5.0
| Distribution | Description | Config Example |
|---|---|---|
| constant | Deterministic fixed value | { type: constant, value: 5.0 } |
| normal | Gaussian, clamped to non-negative | { type: normal, mean: 5.0, std: 0.5 } |
| heavy_tailed | Pareto, for modeling stragglers | { type: heavy_tailed, base: 5.0, alpha: 2.5 } |
| poisson | Discrete count distribution | { type: poisson, lam: 5.0 } |
Analysis
HOPS provides comprehensive performance analysis with built-in visualizations.
Micro-batches per ms
p50, p99, mean per micro-batch
Idle device-time fraction
Compute / total time per stage
Transfer time vs compute
All-reduce & weight update time
Fault count & total downtime
Gantt chart (output/timeline.png) showing per-device compute tasks for forward and backward passes, annotated with failure markers.
4-panel dashboard (output/dashboard.png) with per-stage utilization, latency histogram, bubble ratio, and compute vs. communication breakdown.
Extensibility
HOPS ships with GPipe and 1F1B. Add your own with the plugin API.
from hops import register_scheduler from hops.core.scheduler import Scheduler, PipelineState from hops.core.types import StageTask, Phase class MyScheduler(Scheduler): def next_tasks(self, state: PipelineState) -> list[StageTask]: # state.is_task_ready(stage, mb, phase) # state.completed_count(stage, phase) # state.stage_is_busy(stage) # state.stage_in_flight_count(stage) # state.all_forwards_completed() ... register_scheduler("my_policy", MyScheduler)
Then set policy: my_policy in your YAML config.
Getting Started
Get up and running in under a minute.
curl -LsSf https://astral.sh/uv/install.sh | sh
Requires Python 3.13+
git clone <repo-url> && cd HOPS uv sync
# Default config uv run python main.py # Custom config uv run python main.py --config my_config.yaml # Skip visualization uv run python main.py --no-viz
uv run pytest # all 48 tests uv run pytest tests/test_pipeline.py # specific file uv run pytest -v # verbose output