Agent Program
Use this contract when you are running or reviewing a selected anchor-only
system-delta sweep in tab-foundry.
Use docs/workflows.md for command syntax and artifact expectations. Use this
contract for the objective, the locked comparison surface, the queue
discipline, and the interpretation policy.
Treat this as a research-execution contract, not the architecture roadmap. A
selected sweep may intentionally hold a PFN-adjacent, hybrid diagnostic, or
sandwich-local surface fixed while isolating one question. The long-term
direction for the public model surface still comes from
docs/development/roadmap.md and docs/development/model-architecture.md.
Overview
Read it when you need to know:
- what the selected anchor is
- what is allowed to change in one sweep row
- which artifacts a row must produce
- how to interpret a result without over-claiming
This is not the right page if you only want a general repo overview. Start with
README.md, docs/workflows.md, or
docs/development/model-architecture.md for that broader context.
Objective
Optimize for attributable evidence against the selected sweep anchor, not for rapid base promotion.
The primary score is final_bpc when the selected sweep surface resolves a
sandwich cell_bpc lane; otherwise it falls back to the task-family
classification metric such as final_log_loss.
When the benchmark family changes, switch the sweep target with it:
- sandwich
cell_bpclane:final_bpc - classification fallback:
final_log_loss
Supporting metrics are:
- sandwich
cell_bpclane:final_bpf, plus classification diagnostics when those are also reported - binary classification fallback:
final_brier_score,final_roc_auc,best_roc_auc,final_minus_best - multiclass classification:
final_brier_score, with ROC AUC retained only as a diagnostic when it is reported - training-time deltas versus the anchor
- manifest and preprocessing surface deltas recorded in
training_surface_record.json - loss or gradient instability evidence from
train_history.jsonl,gradient_history.jsonl, andtelemetry.json
best_roc_auc remains a tie-breaker and diagnostic for classification sweeps,
not the main score, and final_log_loss becomes a fallback rather than the
primary benchmark score when final_bpc is available.
Locked Anchor Surface
Hold this surface fixed unless the queue row explicitly declares a different dimension family:
- selected sweep metadata:
reference/system_delta_sweeps/<sweep_id>/sweep.yaml - selected canonical queue:
reference/system_delta_sweeps/<sweep_id>/queue.yaml - selected canonical matrix:
reference/system_delta_sweeps/<sweep_id>/matrix.md - selected anchor run id:
anchor_run_idfrom the chosensweep.yaml - canonical benchmark bundle:
benchmark_bundle_pathfrom the chosensweep.yaml - canonical control baseline id:
control_baseline_idfrom the chosensweep.yaml - canonical benchmark registry:
src/tab_foundry/bench/benchmark_run_registry_v1.json - delta catalog:
reference/system_delta_catalog.yaml - sweep index:
reference/system_delta_sweeps/index.yaml - research template:
reference/system_delta_campaign_template.md - research sources:
reference/stage_research_sources.yaml
Keep these invariant by default:
- benchmark bundle path
- control baseline id
- the queue-declared
training_experiment,training_config_profile, andsurface_role - PFN control lane
- hybrid diagnostic lane
- canonical architecture-screen surface
- history, checkpoint, benchmark, and
training_surface_record.jsonartifact contracts
The benchmark registry is the historical system of record.
Registry-resolved outputs/staged_ladder/... artifact paths are convenience
runtime references for local workspaces. They may be absent in a fresh clone or
CI checkout. Resolve canonical identity through
src/tab_foundry/bench/benchmark_run_registry_v1.json.
Dimension Families
This workflow is not architecture-only. Every queue row must isolate exactly one declared dimension family against the anchor:
- model
- training
- data
- preprocessing
Examples of valid dimensions include:
- module selection inside
tabfoundry_staged - training data source and manifest root
- dagzoo provenance for a manifest-backed surface
- runtime preprocessing and encoding policy
Any mechanism, data, or preprocessing candidate is allowed as long as the row states the exact preserved settings and the exact changed settings.
Queue And Matrix
The canonical source-of-truth hierarchy is:
reference/system_delta_catalog.yamlreference/system_delta_sweeps/index.yamlreference/system_delta_sweeps/<sweep_id>/queue.yamlreference/system_delta_sweeps/<sweep_id>/matrix.md
There is no repo-global active sweep or generated top-level queue or matrix
alias. Use explicit --sweep-id selection instead.
The queue must carry, at minimum:
order,delta_ref,statusdescription,rationale,hypothesis- model or data or preprocessing labels and the one active override family
parameter_adequacy_planexecution_policyrun_id,followup_run_idsdecision,interpretation_status,confounders,next_action,notes
The matrix must be rerendered from the selected sweep plus the canonical benchmark registry. Metrics belong in the registry, not duplicated in the queue.
Use tab-foundry research sweep to:
- create a new sweep with either a parent sweep or an explicit training surface
- list rows in order for one explicit
--sweep-id - print the next
readyrow for one explicit--sweep-id - validate completed rows for one explicit
--sweep-id - render
reference/system_delta_sweeps/<sweep_id>/matrix.md
Every benchmark-facing run belongs to exactly one sweep_id. New complexity
passes should create a new sweep instead of mutating an old completed one.
Required Research Package
Before any empirical run for a queue row, create:
outputs/staged_ladder/research/<sweep_id>/<delta_id>/research_card.mdoutputs/staged_ladder/research/<sweep_id>/<delta_id>/campaign.yaml
After a benchmark_full row is benchmarked and registered, also create:
outputs/staged_ladder/research/<sweep_id>/<delta_id>/result_card.md
Use reference/system_delta_campaign_template.md and
reference/stage_research_sources.yaml.
Agents should use optional sibling-workspace sources when available, but must be able to proceed from the required repo-local sources alone.
Benchmark-facing conclusions must cite the locked bundle path,
cls_benchmark_linear_v2, training_surface_record.json, research_card.md,
campaign.yaml, and result_card.md. Evidence collected only on the hybrid
diagnostic lane may guide diagnosis, but it is not by itself benchmark-facing
promotion evidence for the architecture-screen surface.
Every completed run must have a training_surface_record.json artifact. That
record is the system-surface evidence source for:
- model surface labels and effective module selections
- surfaced subsystem hyperparameters
- data source and manifest fingerprint
- dagzoo provenance references when applicable
- dataset-characteristic summaries
- preprocessing surface labels and explicit overrides
Queue reruns for instability debugging must also produce:
train_history.jsonlas the canonical scalar timelinegradient_history.jsonlwith module-level gradient tracestelemetry.jsonwith run summary, artifact pointers, checkpoint snapshots, missingness diagnostics, and failure context
Treat the completed first-pass binary_md_v1 outputs under
outputs/staged_ladder/sd_binary_md_v1_* as read-only baseline evidence. Do
not overwrite those run directories when adding instability instrumentation.
Use fresh rerun roots such as
outputs/staged_ladder/<run_id>_diag_v1/train. Historical runs can only be
audited from their scalar histories; true module-level traces only exist for
new reruns.
Execution Loop
For each queue row:
- Select the sweep explicitly with
--sweep-idand loadreference/system_delta_sweeps/<sweep_id>/sweep.yaml. - Select the next
status=readyrow fromreference/system_delta_sweeps/<sweep_id>/queue.yaml. - Write or update
research_card.mdandcampaign.yaml. - Train on the locked anchor surface, changing only the declared dimension.
- Ensure the run has
training_surface_record.json,gradient_history.jsonl, andtelemetry.json. - If
execution_policy=screen_only, stop after recording screen metrics in the queue and rerender the matrix; skip benchmark registration, do not writeresult_card.md, and treat the row as diagnostic only. - If
execution_policy=benchmark_full, benchmark on the bundle declared by the selected sweep metadata. - If
execution_policy=benchmark_full, register the benchmark-facing run insrc/tab_foundry/bench/benchmark_run_registry_v1.json, including itssweep_id. - If
execution_policy=benchmark_full, writeresult_card.md. - Rerender
reference/system_delta_sweeps/<sweep_id>/matrix.md. - Update the queue row status, run ids, interpretation, and next action.
To rank the existing first-pass binary_md_v1 outputs before rerunning,
generate the scalar instability audit report under
outputs/staged_ladder/reports/ with:
.venv/bin/python scripts/bench/instability_audit.py \
--staged-ladder-root outputs/staged_ladder \
--sweep-id binary_md_v1
This pass is attribution-first. No row becomes the new base during the sweep.
screen_only rows are not benchmark-facing replacements for the anchor.
Decisions
Use these decisions:
keep: the row is isolated, evidence is at least neutral or improved on the task-family primary final metric (final_bpcwhen present, otherwise the current classification fallback such asfinal_log_loss), and the interpretation does not reveal unresolved confounding severe enough to block the signaldefer: evidence is mixed, the row is not isolated enough yet, or the introduced degrees of freedom have not been checked adequatelyreject: only allowed when the row is isolated, the adequacy plan was completed, and the result is clearly worse without offsetting benefit
Underperformance alone is not enough for reject.
Every result_card.md must explain:
- what changed
- what metrics moved versus the anchor
- whether the change was actually isolated
- whether introduced hyperparameters were adequate
- why the change may have helped or hurt
- which confounders remain
- what bounded follow-up is still required, if any
Rows like delta_row_cls_pool are the template for this policy. If row_cls
underperforms once, that is not evidence against the mechanism by itself. The
result card must discuss whether tfrow_n_heads, tfrow_n_layers, and
tfrow_cls_tokens were likely appropriate before any negative conclusion is
treated as strong evidence.
The same rule applies to data and preprocessing rows. If a data-surface or preprocessing row underperforms, the result card must discuss manifest adequacy, preprocessing adequacy, and any remaining boundary ambiguity before a negative read is treated as strong evidence.