Agent Program

Rules for the active system-delta sweep: locked surface, allowed changes, and required artifacts.

Use this contract when you are running or reviewing a selected anchor-only system-delta sweep in tab-foundry.

Use docs/workflows.md for command syntax and artifact expectations. Use this contract for the objective, the locked comparison surface, the queue discipline, and the interpretation policy.

Treat this as a research-execution contract, not the architecture roadmap. A selected sweep may intentionally hold a PFN-adjacent, hybrid diagnostic, or sandwich-local surface fixed while isolating one question. The long-term direction for the public model surface still comes from docs/development/roadmap.md and docs/development/model-architecture.md.

Overview

Read it when you need to know:

what the selected anchor is
what is allowed to change in one sweep row
which artifacts a row must produce
how to interpret a result without over-claiming

This is not the right page if you only want a general repo overview. Start with README.md, docs/workflows.md, or docs/development/model-architecture.md for that broader context.

Objective

Optimize for attributable evidence against the selected sweep anchor, not for rapid base promotion.

The primary score is final_bpc when the selected sweep surface resolves a sandwich cell_bpc lane; otherwise it falls back to the task-family classification metric such as final_log_loss.

When the benchmark family changes, switch the sweep target with it:

sandwich cell_bpc lane: final_bpc
classification fallback: final_log_loss

Supporting metrics are:

sandwich cell_bpc lane: final_bpf, plus classification diagnostics when those are also reported
binary classification fallback: final_brier_score, final_roc_auc, best_roc_auc, final_minus_best
multiclass classification: final_brier_score, with ROC AUC retained only as a diagnostic when it is reported
training-time deltas versus the anchor
manifest and preprocessing surface deltas recorded in training_surface_record.json
loss or gradient instability evidence from train_history.jsonl, gradient_history.jsonl, and telemetry.json

best_roc_auc remains a tie-breaker and diagnostic for classification sweeps, not the main score, and final_log_loss becomes a fallback rather than the primary benchmark score when final_bpc is available.

Locked Anchor Surface

Hold this surface fixed unless the queue row explicitly declares a different dimension family:

selected sweep metadata: reference/system_delta_sweeps/<sweep_id>/sweep.yaml
selected canonical queue: reference/system_delta_sweeps/<sweep_id>/queue.yaml
selected canonical matrix: reference/system_delta_sweeps/<sweep_id>/matrix.md
selected anchor run id: anchor_run_id from the chosen sweep.yaml
canonical benchmark bundle: benchmark_bundle_path from the chosen sweep.yaml
canonical control baseline id: control_baseline_id from the chosen sweep.yaml
canonical benchmark registry: src/tab_foundry/bench/benchmark_run_registry_v1.json
delta catalog: reference/system_delta_catalog.yaml
sweep index: reference/system_delta_sweeps/index.yaml
research template: reference/system_delta_campaign_template.md
research sources: reference/stage_research_sources.yaml

Keep these invariant by default:

benchmark bundle path
control baseline id
the queue-declared training_experiment, training_config_profile, and surface_role
PFN control lane
hybrid diagnostic lane
canonical architecture-screen surface
history, checkpoint, benchmark, and training_surface_record.json artifact contracts

The benchmark registry is the historical system of record.

Registry-resolved outputs/staged_ladder/... artifact paths are convenience runtime references for local workspaces. They may be absent in a fresh clone or CI checkout. Resolve canonical identity through src/tab_foundry/bench/benchmark_run_registry_v1.json.

Dimension Families

This workflow is not architecture-only. Every queue row must isolate exactly one declared dimension family against the anchor:

model
training
data
preprocessing

Examples of valid dimensions include:

module selection inside tabfoundry_staged
training data source and manifest root
dagzoo provenance for a manifest-backed surface
runtime preprocessing and encoding policy

Any mechanism, data, or preprocessing candidate is allowed as long as the row states the exact preserved settings and the exact changed settings.

Queue And Matrix

The canonical source-of-truth hierarchy is:

reference/system_delta_catalog.yaml
reference/system_delta_sweeps/index.yaml
reference/system_delta_sweeps/<sweep_id>/queue.yaml
reference/system_delta_sweeps/<sweep_id>/matrix.md

There is no repo-global active sweep or generated top-level queue or matrix alias. Use explicit --sweep-id selection instead.

The queue must carry, at minimum:

order, delta_ref, status
description, rationale, hypothesis
model or data or preprocessing labels and the one active override family
parameter_adequacy_plan
execution_policy
run_id, followup_run_ids
decision, interpretation_status, confounders, next_action, notes

The matrix must be rerendered from the selected sweep plus the canonical benchmark registry. Metrics belong in the registry, not duplicated in the queue.

Use tab-foundry research sweep to:

create a new sweep with either a parent sweep or an explicit training surface
list rows in order for one explicit --sweep-id
print the next ready row for one explicit --sweep-id
validate completed rows for one explicit --sweep-id
render reference/system_delta_sweeps/<sweep_id>/matrix.md

Every benchmark-facing run belongs to exactly one sweep_id. New complexity passes should create a new sweep instead of mutating an old completed one.

Required Research Package

Before any empirical run for a queue row, create:

outputs/staged_ladder/research/<sweep_id>/<delta_id>/research_card.md
outputs/staged_ladder/research/<sweep_id>/<delta_id>/campaign.yaml

After a benchmark_full row is benchmarked and registered, also create:

outputs/staged_ladder/research/<sweep_id>/<delta_id>/result_card.md

Use reference/system_delta_campaign_template.md and reference/stage_research_sources.yaml.

Agents should use optional sibling-workspace sources when available, but must be able to proceed from the required repo-local sources alone.

Benchmark-facing conclusions must cite the locked bundle path, cls_benchmark_linear_v2, training_surface_record.json, research_card.md, campaign.yaml, and result_card.md. Evidence collected only on the hybrid diagnostic lane may guide diagnosis, but it is not by itself benchmark-facing promotion evidence for the architecture-screen surface.

Every completed run must have a training_surface_record.json artifact. That record is the system-surface evidence source for:

model surface labels and effective module selections
surfaced subsystem hyperparameters
data source and manifest fingerprint
dagzoo provenance references when applicable
dataset-characteristic summaries
preprocessing surface labels and explicit overrides

Queue reruns for instability debugging must also produce:

train_history.jsonl as the canonical scalar timeline
gradient_history.jsonl with module-level gradient traces
telemetry.json with run summary, artifact pointers, checkpoint snapshots, missingness diagnostics, and failure context

Treat the completed first-pass binary_md_v1 outputs under outputs/staged_ladder/sd_binary_md_v1_* as read-only baseline evidence. Do not overwrite those run directories when adding instability instrumentation. Use fresh rerun roots such as outputs/staged_ladder/<run_id>_diag_v1/train. Historical runs can only be audited from their scalar histories; true module-level traces only exist for new reruns.

Execution Loop

For each queue row:

Select the sweep explicitly with --sweep-id and load reference/system_delta_sweeps/<sweep_id>/sweep.yaml.
Select the next status=ready row from reference/system_delta_sweeps/<sweep_id>/queue.yaml.
Write or update research_card.md and campaign.yaml.
Train on the locked anchor surface, changing only the declared dimension.
Ensure the run has training_surface_record.json, gradient_history.jsonl, and telemetry.json.
If execution_policy=screen_only, stop after recording screen metrics in the queue and rerender the matrix; skip benchmark registration, do not write result_card.md, and treat the row as diagnostic only.
If execution_policy=benchmark_full, benchmark on the bundle declared by the selected sweep metadata.
If execution_policy=benchmark_full, register the benchmark-facing run in src/tab_foundry/bench/benchmark_run_registry_v1.json, including its sweep_id.
If execution_policy=benchmark_full, write result_card.md.
Rerender reference/system_delta_sweeps/<sweep_id>/matrix.md.
Update the queue row status, run ids, interpretation, and next action.

To rank the existing first-pass binary_md_v1 outputs before rerunning, generate the scalar instability audit report under outputs/staged_ladder/reports/ with:

.venv/bin/python scripts/bench/instability_audit.py \
  --staged-ladder-root outputs/staged_ladder \
  --sweep-id binary_md_v1

This pass is attribution-first. No row becomes the new base during the sweep. screen_only rows are not benchmark-facing replacements for the anchor.

Decisions

Use these decisions:

keep: the row is isolated, evidence is at least neutral or improved on the task-family primary final metric (final_bpc when present, otherwise the current classification fallback such as final_log_loss), and the interpretation does not reveal unresolved confounding severe enough to block the signal
defer: evidence is mixed, the row is not isolated enough yet, or the introduced degrees of freedom have not been checked adequately
reject: only allowed when the row is isolated, the adequacy plan was completed, and the result is clearly worse without offsetting benefit

Underperformance alone is not enough for reject.

Every result_card.md must explain:

what changed
what metrics moved versus the anchor
whether the change was actually isolated
whether introduced hyperparameters were adequate
why the change may have helped or hurt
which confounders remain
what bounded follow-up is still required, if any

Rows like delta_row_cls_pool are the template for this policy. If row_cls underperforms once, that is not evidence against the mechanism by itself. The result card must discuss whether tfrow_n_heads, tfrow_n_layers, and tfrow_cls_tokens were likely appropriate before any negative conclusion is treated as strong evidence.

The same rule applies to data and preprocessing rows. If a data-surface or preprocessing row underperforms, the result card must discuss manifest adequacy, preprocessing adequacy, and any remaining boundary ambiguity before a negative read is treated as strong evidence.