Skip to content

Ticket 088: Performance Benchmarking and Regression Gates

Status

Planned.

Goal

Add a structured performance benchmarking suite with CI regression gates so that throughput regressions are caught before they reach main. This is a prerequisite for the hardware-in-the-loop work (Ticket 087) and any future REST API (Ticket 050), both of which require known latency budgets.

Why This Comes Before HITL and Live Flight

HITL validation and a future REST API require real-time or near-real-time execution. Without baseline benchmarks and regression gates:

  • A refactor that doubles estimator latency will not be caught by correctness tests alone.
  • There is no documented throughput contract to target for HITL integration (where the onboard companion computer may re-run estimates mid-flight).
  • Batch users have no stated throughput guarantee when scaling to hundreds of missions.

Current Baseline (measured 2026-05-28, pipeline_demo_001 mission)

Workload Throughput Per-call latency
Single deterministic estimate ~6 800 estimates/s ~0.15 ms
Monte Carlo (200 samples, wind uncertainty) ~5 700 samples/s ~0.18 ms/sample

These numbers should be preserved or improved. A >20% regression on either metric should fail CI.

Scope

Benchmark harness

  • Add tests/perf/ directory with a standalone benchmark script and a pytest-benchmark suite.
  • Do not require pytest-benchmark in the main dev extras group; add it to an optional perf extras group so CI can opt in explicitly:
    [project.optional-dependencies]
    perf = ["pytest-benchmark>=4.0"]
    

Benchmark targets

  1. Deterministic estimatetry_estimate_mission_distance_time with the pipeline_demo_001 mission and quadplane_v1 vehicle. No assets (no terrain, no wind grid, no geofences). Measures core execution path.
  2. Deterministic estimate with assets — Same mission with terrain and geofences loaded. Measures asset-loading overhead vs. pure computation.
  3. Monte Carlo (N=200, wind uncertainty)run_monte_carlo with pipeline_demo_001_wind_uncertainty.yaml. Measures sampler throughput.
  4. Monte Carlo (N=1000, wind uncertainty) — Scaling check; should scale approximately linearly with sample count.
  5. Stochastic propagation (N=50 particles)run_stochastic_propagation with pipeline_demo_001_stochastic.yaml. Measures particle propagation loop overhead.
  6. Batch estimate (10 runs)run_batch_manifest with a synthetic 10-run manifest. Measures per-run overhead including file I/O and schema validation.

Regression gates

  • Add a make perf or uv run pytest tests/perf/ --benchmark-compare target.
  • Store baseline JSON in tests/perf/baseline.json generated by:
    uv run pytest tests/perf/ --benchmark-save=baseline
    
  • CI gate: if mean latency for any benchmark regresses by more than 20% versus the stored baseline, the job fails.
  • Baseline is updated deliberately (not automatically on every merge); a PR that intentionally changes performance includes an updated baseline.json.

Profiling helper

  • Add tests/perf/profile_estimate.py — a standalone script (not a pytest test) that runs cProfile on the deterministic estimator and prints the top-20 hotspots. Used for manual investigation, not CI.

Documentation

  • Add docs/PERFORMANCE.md documenting the baseline numbers, how to run benchmarks, and how to interpret a regression failure.
  • Add the perf extras group to README.md install instructions.

Composition

  • Uses existing try_estimate_mission_distance_time, run_monte_carlo, and run_stochastic_propagation public APIs directly — no new execution paths.
  • Example input files in examples/missions/, examples/vehicles/, and examples/uncertainty/ are reused; no new fixtures required.
  • All 865 existing tests continue to pass; benchmark tests are collected only when the perf extras are installed.

Acceptance Criteria

  • uv run --group perf pytest tests/perf/ -v runs all benchmark tests and reports mean/min/max latency for each target.
  • uv run --group perf pytest tests/perf/ --benchmark-compare=baseline.json exits non-zero if any benchmark regresses by more than 20%.
  • tests/perf/baseline.json is committed and reflects measurements from the CI environment (not a developer laptop).
  • docs/PERFORMANCE.md exists and documents the baseline numbers and the regression gate threshold.
  • No production code changes are required; all changes are in tests/perf/, pyproject.toml, and docs/.