Benchmark Workflows and Guardrails

Automated quality checks for benchmark suite runs.

Use benchmark workflows to validate throughput/latency and enforce regression guardrails across default and feature-specific configs.

When to use

You need fast smoke checks before wider experimentation.
You want standardized performance baselines by preset/suite.
You need CI gating with warn/fail regression thresholds.

Baseline workflows

Quick smoke and broader standard runs:

dagzoo benchmark --suite smoke --preset cpu --out-dir benchmarks/results/smoke_cpu
dagzoo benchmark --suite standard --preset cpu --out-dir benchmarks/results/standard_cpu

Diagnostics-enabled benchmark:

dagzoo benchmark \
  --suite smoke \
  --preset cpu \
  --diagnostics \
  --out-dir benchmarks/results/smoke_cpu_diag

Device override note:

--device only applies when a benchmark run selects exactly one --preset.
Multi-preset runs must encode device choice in each preset/config; the CLI now rejects ambiguous shared --device overrides.

Feature-specific guardrail runs

dagzoo benchmark \
  --config configs/preset_filter_benchmark_smoke.yaml \
  --preset custom \
  --suite smoke \
  --hardware-policy none \
  --no-memory \
  --out-dir benchmarks/results/smoke_filter

dagzoo benchmark \
  --config configs/preset_missingness_mar.yaml \
  --preset custom \
  --suite smoke \
  --no-memory \
  --out-dir benchmarks/results/smoke_missing_mar

dagzoo benchmark \
  --config configs/preset_shift_benchmark_smoke.yaml \
  --preset custom \
  --suite smoke \
  --no-memory \
  --out-dir benchmarks/results/smoke_shift_guardrails

dagzoo benchmark \
  --config configs/preset_noise_benchmark_smoke.yaml \
  --preset custom \
  --suite smoke \
  --no-memory \
  --out-dir benchmarks/results/smoke_noise_guardrails

Filter-enabled benchmark workflow

Use the filter smoke preset when you want one canonical CPU benchmark run that surfaces filter-stage throughput, accepted-corpus throughput, and acceptance yield together:

dagzoo benchmark \
  --config configs/preset_filter_benchmark_smoke.yaml \
  --preset custom \
  --suite smoke \
  --hardware-policy none \
  --no-memory \
  --out-dir benchmarks/results/smoke_filter

Inspect these summary.json preset-result fields first:

filter_datasets_per_minute
filter_accepted_datasets_per_minute
filter_accepted_datasets_measured
filter_rejected_datasets_measured
filter_acceptance_rate_dataset_level
filter_rejection_rate_dataset_level
filter_rejection_rate_attempt_level
filter_retry_dataset_rate

The CLI preset line prints the same headline values as filter/min, filter_accepted/min, filter_accept_dataset_pct, and filter_reject_dataset_pct.

Diversity audit workflow

Use diversity-audit when you need a baseline-vs-variant comparison of the accepted corpus, not just benchmark throughput:

dagzoo diversity-audit \
  --baseline-config configs/default.yaml \
  --variant-config configs/preset_shift_benchmark_smoke.yaml \
  --suite smoke \
  --num-datasets 10 \
  --warmup 0 \
  --device cpu \
  --out-dir benchmarks/results/diversity_audit_shift

Inspect these summary.json fields first:

comparisons[*].diversity_status
comparisons[*].diversity_composite_shift_pct
comparisons[*].datasets_per_minute_delta_pct
comparisons[*].filter_accepted_datasets_per_minute_delta_pct

The rewritten audit persists summary.json and summary.md as the canonical equivalence/local-overlap and cross-run diversity outputs.

Filter calibration workflow

Use filter-calibration when you want to sweep filter thresholds on one filter-enabled config and rank accepted-corpus throughput against diversity shift:

dagzoo filter-calibration \
  --config configs/preset_filter_benchmark_smoke.yaml \
  --suite smoke \
  --device cpu \
  --out-dir benchmarks/results/filter_calibration_smoke

Inspect these summary.json fields first:

summary.best_overall_threshold_requested
summary.best_passing_threshold_requested
summary.best_overall_diversity_status
candidates[*].filter_accepted_datasets_per_minute
candidates[*].filter_acceptance_rate_dataset_level
candidates[*].diversity_status
candidates[*].diversity_composite_shift_pct

Like the rewritten diversity audit, filter calibration persists only summary.json and summary.md.

Regression gating

For CI-like checks:

dagzoo benchmark \
  --config configs/preset_shift_benchmark_smoke.yaml \
  --preset custom \
  --suite smoke \
  --warn-threshold-pct 10 \
  --fail-threshold-pct 20 \
  --fail-on-regression \
  --hardware-policy none \
  --no-memory \
  --out-dir benchmarks/results/ci_smoke_shift_local

What to inspect

When present in a run summary, inspect:

missingness_guardrails
lineage_guardrails
shift_guardrails
noise_guardrails

Also review throughput/latency aggregates for preset/suite trends.

Workflow hub: usage-guide.md
Output contract: output-format.md
Noise workflows: noise.md