Ticket 088: Performance Benchmarking and Regression Gates¶

Status¶

Planned.

Goal¶

Add a structured performance benchmarking suite with CI regression gates so that throughput regressions are caught before they reach main. This is a prerequisite for the hardware-in-the-loop work (Ticket 087) and any future REST API (Ticket 050), both of which require known latency budgets.

Why This Comes Before HITL and Live Flight¶

HITL validation and a future REST API require real-time or near-real-time execution. Without baseline benchmarks and regression gates:

A refactor that doubles estimator latency will not be caught by correctness tests alone.
There is no documented throughput contract to target for HITL integration (where the onboard companion computer may re-run estimates mid-flight).
Batch users have no stated throughput guarantee when scaling to hundreds of missions.

Current Baseline (measured 2026-05-28, pipeline_demo_001 mission)¶

Workload	Throughput	Per-call latency
Single deterministic estimate	~6 800 estimates/s	~0.15 ms
Monte Carlo (200 samples, wind uncertainty)	~5 700 samples/s	~0.18 ms/sample

These numbers should be preserved or improved. A >20% regression on either metric should fail CI.

Scope¶

Benchmark harness¶

Add tests/perf/ directory with a standalone benchmark script and a pytest-benchmark suite.
Do not require pytest-benchmark in the main dev extras group; add it to an optional perf extras group so CI can opt in explicitly:
```
[project.optional-dependencies]
perf = ["pytest-benchmark>=4.0"]
```

Benchmark targets¶

Deterministic estimate — try_estimate_mission_distance_time with the pipeline_demo_001 mission and quadplane_v1 vehicle. No assets (no terrain, no wind grid, no geofences). Measures core execution path.
Deterministic estimate with assets — Same mission with terrain and geofences loaded. Measures asset-loading overhead vs. pure computation.
Monte Carlo (N=200, wind uncertainty) — run_monte_carlo with pipeline_demo_001_wind_uncertainty.yaml. Measures sampler throughput.
Monte Carlo (N=1000, wind uncertainty) — Scaling check; should scale approximately linearly with sample count.
Stochastic propagation (N=50 particles) — run_stochastic_propagation with pipeline_demo_001_stochastic.yaml. Measures particle propagation loop overhead.
Batch estimate (10 runs) — run_batch_manifest with a synthetic 10-run manifest. Measures per-run overhead including file I/O and schema validation.

Regression gates¶

Add a make perf or uv run pytest tests/perf/ --benchmark-compare target.
Store baseline JSON in tests/perf/baseline.json generated by:
```
uv run pytest tests/perf/ --benchmark-save=baseline
```
CI gate: if mean latency for any benchmark regresses by more than 20% versus the stored baseline, the job fails.
Baseline is updated deliberately (not automatically on every merge); a PR that intentionally changes performance includes an updated baseline.json.

Profiling helper¶

Add tests/perf/profile_estimate.py — a standalone script (not a pytest test) that runs cProfile on the deterministic estimator and prints the top-20 hotspots. Used for manual investigation, not CI.

Documentation¶

Add docs/PERFORMANCE.md documenting the baseline numbers, how to run benchmarks, and how to interpret a regression failure.
Add the perf extras group to README.md install instructions.

Composition¶

Uses existing try_estimate_mission_distance_time, run_monte_carlo, and run_stochastic_propagation public APIs directly — no new execution paths.
Example input files in examples/missions/, examples/vehicles/, and examples/uncertainty/ are reused; no new fixtures required.
All 865 existing tests continue to pass; benchmark tests are collected only when the perf extras are installed.

Acceptance Criteria¶

uv run --group perf pytest tests/perf/ -v runs all benchmark tests and reports mean/min/max latency for each target.
uv run --group perf pytest tests/perf/ --benchmark-compare=baseline.json exits non-zero if any benchmark regresses by more than 20%.
tests/perf/baseline.json is committed and reflects measurements from the CI environment (not a developer laptop).
docs/PERFORMANCE.md exists and documents the baseline numbers and the regression gate threshold.
No production code changes are required; all changes are in tests/perf/, pyproject.toml, and docs/.