Benchmark Workflows and Guardrails

Automated quality checks for benchmark suite runs.

Use benchmark workflows to validate throughput/latency and enforce regression guardrails across default and feature-specific configs.


When to use

  • You need fast smoke checks before wider experimentation.
  • You want standardized performance baselines by preset/suite.
  • You need CI gating with warn/fail regression thresholds.

Baseline workflows

Quick smoke and broader standard runs:

dagzoo benchmark --suite smoke --preset cpu --out-dir benchmarks/results/smoke_cpu
dagzoo benchmark --suite standard --preset cpu --out-dir benchmarks/results/standard_cpu

Diagnostics-enabled benchmark:

dagzoo benchmark \
  --suite smoke \
  --preset cpu \
  --diagnostics \
  --out-dir benchmarks/results/smoke_cpu_diag

Device override note:

  • --device only applies when a benchmark run selects exactly one --preset.
  • Multi-preset runs must encode device choice in each preset/config; the CLI now rejects ambiguous shared --device overrides.

Feature-specific guardrail runs

dagzoo benchmark \
  --config configs/preset_filter_benchmark_smoke.yaml \
  --preset custom \
  --suite smoke \
  --hardware-policy none \
  --no-memory \
  --out-dir benchmarks/results/smoke_filter

dagzoo benchmark \
  --config configs/preset_missingness_mar.yaml \
  --preset custom \
  --suite smoke \
  --no-memory \
  --out-dir benchmarks/results/smoke_missing_mar

dagzoo benchmark \
  --config configs/preset_shift_benchmark_smoke.yaml \
  --preset custom \
  --suite smoke \
  --no-memory \
  --out-dir benchmarks/results/smoke_shift_guardrails

dagzoo benchmark \
  --config configs/preset_noise_benchmark_smoke.yaml \
  --preset custom \
  --suite smoke \
  --no-memory \
  --out-dir benchmarks/results/smoke_noise_guardrails

Filter-enabled benchmark workflow

Use the filter smoke preset when you want one canonical CPU benchmark run that surfaces filter-stage throughput, accepted-corpus throughput, and acceptance yield together:

dagzoo benchmark \
  --config configs/preset_filter_benchmark_smoke.yaml \
  --preset custom \
  --suite smoke \
  --hardware-policy none \
  --no-memory \
  --out-dir benchmarks/results/smoke_filter

Inspect these summary.json preset-result fields first:

  • filter_datasets_per_minute
  • filter_accepted_datasets_per_minute
  • filter_accepted_datasets_measured
  • filter_rejected_datasets_measured
  • filter_acceptance_rate_dataset_level
  • filter_rejection_rate_dataset_level
  • filter_rejection_rate_attempt_level
  • filter_retry_dataset_rate

The CLI preset line prints the same headline values as filter/min, filter_accepted/min, filter_accept_dataset_pct, and filter_reject_dataset_pct.


Diversity audit workflow

Use diversity-audit when you need a baseline-vs-variant comparison of the accepted corpus, not just benchmark throughput:

dagzoo diversity-audit \
  --baseline-config configs/default.yaml \
  --variant-config configs/preset_shift_benchmark_smoke.yaml \
  --suite smoke \
  --num-datasets 10 \
  --warmup 0 \
  --device cpu \
  --out-dir benchmarks/results/diversity_audit_shift

Inspect these summary.json fields first:

  • comparisons[*].diversity_status
  • comparisons[*].diversity_composite_shift_pct
  • comparisons[*].datasets_per_minute_delta_pct
  • comparisons[*].filter_accepted_datasets_per_minute_delta_pct

The rewritten audit persists summary.json and summary.md as the canonical equivalence/local-overlap and cross-run diversity outputs.


Filter calibration workflow

Use filter-calibration when you want to sweep filter thresholds on one filter-enabled config and rank accepted-corpus throughput against diversity shift:

dagzoo filter-calibration \
  --config configs/preset_filter_benchmark_smoke.yaml \
  --suite smoke \
  --device cpu \
  --out-dir benchmarks/results/filter_calibration_smoke

Inspect these summary.json fields first:

  • summary.best_overall_threshold_requested
  • summary.best_passing_threshold_requested
  • summary.best_overall_diversity_status
  • candidates[*].filter_accepted_datasets_per_minute
  • candidates[*].filter_acceptance_rate_dataset_level
  • candidates[*].diversity_status
  • candidates[*].diversity_composite_shift_pct

Like the rewritten diversity audit, filter calibration persists only summary.json and summary.md.


Regression gating

For CI-like checks:

dagzoo benchmark \
  --config configs/preset_shift_benchmark_smoke.yaml \
  --preset custom \
  --suite smoke \
  --warn-threshold-pct 10 \
  --fail-threshold-pct 20 \
  --fail-on-regression \
  --hardware-policy none \
  --no-memory \
  --out-dir benchmarks/results/ci_smoke_shift_local

What to inspect

When present in a run summary, inspect:

  • missingness_guardrails
  • lineage_guardrails
  • shift_guardrails
  • noise_guardrails

Also review throughput/latency aggregates for preset/suite trends.