Benchmark Workflows and Guardrails
Use benchmark workflows to validate throughput/latency and enforce regression guardrails across default, feature-specific, and stress-profile configs.
When to use
- You need fast smoke checks before wider experimentation.
- You want standardized performance baselines by preset/suite.
- You need CI gating with warn/fail regression thresholds.
Baseline workflows
Quick smoke and broader standard runs:
dagzoo benchmark --suite smoke --preset cpu --out-dir benchmarks/results/smoke_cpu
dagzoo benchmark --suite standard --preset cpu --out-dir benchmarks/results/standard_cpu
Diagnostics-enabled benchmark:
dagzoo benchmark \
--suite smoke \
--preset cpu \
--diagnostics \
--out-dir benchmarks/results/smoke_cpu_diag
Device override note:
--deviceonly applies when a benchmark run selects exactly one--preset.- Multi-preset runs must encode device choice in each preset/config; the CLI now
rejects ambiguous shared
--deviceoverrides.
Feature-specific guardrail runs
dagzoo benchmark \
--config configs/preset_filter_benchmark_smoke.yaml \
--preset custom \
--suite smoke \
--hardware-policy none \
--no-memory \
--out-dir benchmarks/results/smoke_filter
dagzoo benchmark \
--config configs/preset_missingness_mar.yaml \
--preset custom \
--suite smoke \
--no-memory \
--out-dir benchmarks/results/smoke_missing_mar
dagzoo benchmark \
--config configs/preset_shift_benchmark_smoke.yaml \
--preset custom \
--suite smoke \
--no-memory \
--out-dir benchmarks/results/smoke_shift
dagzoo benchmark \
--config configs/preset_noise_benchmark_smoke.yaml \
--preset custom \
--suite smoke \
--no-memory \
--out-dir benchmarks/results/smoke_noise
dagzoo benchmark \
--config configs/preset_steering_anti_memorization_benchmark_smoke.yaml \
--preset custom \
--suite smoke \
--diagnostics \
--no-memory \
--out-dir benchmarks/results/smoke_steering
dagzoo benchmark \
--config configs/preset_stress_classification_slice_benchmark_smoke.yaml \
--preset custom \
--suite smoke \
--no-memory \
--out-dir benchmarks/results/smoke_stress_classification_slice
dagzoo benchmark \
--config configs/preset_stress_graph_breadth_benchmark_smoke.yaml \
--preset custom \
--suite smoke \
--no-memory \
--out-dir benchmarks/results/smoke_stress_graph_breadth
dagzoo benchmark \
--config configs/preset_stress_compositional_benchmark_smoke.yaml \
--preset custom \
--suite smoke \
--no-memory \
--out-dir benchmarks/results/smoke_stress_compositional
Filter-enabled benchmark workflow
Use the filter smoke preset when you want one canonical CPU benchmark run that surfaces filter-stage throughput, accepted-corpus throughput, and acceptance yield together:
dagzoo benchmark \
--config configs/preset_filter_benchmark_smoke.yaml \
--preset custom \
--suite smoke \
--hardware-policy none \
--no-memory \
--out-dir benchmarks/results/smoke_filter
This preset measures the structural replay path directly, so the smoke run’s filter throughput and yield reflect lineage-based acceptance instead of a separate learned-filter stage.
Inspect these summary.json preset-result fields first:
filter_datasets_per_minutefilter_accepted_datasets_per_minutefilter_accepted_datasets_measuredfilter_rejected_datasets_measuredfilter_acceptance_rate_dataset_levelfilter_rejection_rate_dataset_levelfilter_rejection_rate_attempt_levelfilter_retry_dataset_rate
The CLI preset line prints the same headline values as filter/min,
filter_accepted/min, filter_accept_dataset_pct, and
filter_reject_dataset_pct.
Diversity audit workflow
Use diversity-audit when you need a baseline-vs-variant comparison of the
accepted corpus, not just benchmark throughput:
dagzoo diversity-audit \
--baseline-config configs/default.yaml \
--variant-config configs/preset_shift_benchmark_smoke.yaml \
--suite smoke \
--num-datasets 10 \
--warmup 0 \
--device cpu \
--out-dir benchmarks/results/diversity_audit_shift
Inspect these summary.json fields first:
comparisons[*].diversity_statuscomparisons[*].diversity_composite_shift_pctcomparisons[*].datasets_per_minute_delta_pctcomparisons[*].filter_accepted_datasets_per_minute_delta_pct
The rewritten audit persists summary.json and summary.md as the canonical
equivalence/local-overlap and cross-run diversity outputs.
For robustness stress profiles, use configs/default.yaml as the baseline and
swap one stress benchmark preset in as the variant:
dagzoo diversity-audit \
--baseline-config configs/default.yaml \
--variant-config configs/preset_stress_graph_breadth_benchmark_smoke.yaml \
--suite smoke \
--num-datasets 10 \
--warmup 0 \
--device cpu \
--out-dir benchmarks/results/diversity_audit_stress_graph_breadth
Use the same pattern with:
configs/preset_stress_classification_slice_benchmark_smoke.yamlconfigs/preset_stress_graph_breadth_benchmark_smoke.yamlconfigs/preset_stress_compositional_benchmark_smoke.yaml
Regression gating
For CI-like checks:
dagzoo benchmark \
--config configs/preset_shift_benchmark_smoke.yaml \
--preset custom \
--suite smoke \
--warn-threshold-pct 10 \
--fail-threshold-pct 20 \
--fail-on-regression \
--hardware-policy none \
--no-memory \
--out-dir benchmarks/results/ci_smoke_shift_local
What to inspect
When present in a run summary, inspect:
preset_results[*].scenarios.filteringpreset_results[*].scenarios.missingnesspreset_results[*].scenarios.shiftpreset_results[*].scenarios.noisepreset_results[*].scenarios.throughput
Also review throughput/latency aggregates for preset/suite trends.
For diagnostics-enabled runs, inspect preset_results[*].diagnostics_artifacts
first and then open the pointed coverage_summary.json / coverage_summary.md
files. Steering stays on the diagnostics artifact contract; benchmark summaries
do not emit a separate steering_guardrails field.
Related docs
- Workflow hub: usage-guide.md
- Output contract: output-format.md
- Noise workflows: noise.md
- Stress profiles: stress-profiles.md
- Steering workflows: steering.md