Agent Program

Rules for the active system-delta sweep: locked surface, allowed changes, and required artifacts.

Use this contract when you are running or reviewing a selected anchor-only system-delta sweep in tab-foundry.

Use docs/workflows.md for command syntax and artifact expectations. Use this contract for the objective, the locked comparison surface, the queue discipline, and the interpretation policy.

Treat this as a research-execution contract, not the architecture roadmap. A selected sweep may intentionally hold a PFN-adjacent, hybrid diagnostic, or sandwich-local surface fixed while isolating one question. The long-term direction for the public model surface still comes from docs/development/roadmap.md and docs/development/model-architecture.md.

Overview

Read it when you need to know:

  • what the selected anchor is
  • what is allowed to change in one sweep row
  • which artifacts a row must produce
  • how to interpret a result without over-claiming

This is not the right page if you only want a general repo overview. Start with README.md, docs/workflows.md, or docs/development/model-architecture.md for that broader context.

Objective

Optimize for attributable evidence against the selected sweep anchor, not for rapid base promotion.

The primary score is final_bpc when the selected sweep surface resolves a sandwich cell_bpc lane; otherwise it falls back to the task-family classification metric such as final_log_loss.

When the benchmark family changes, switch the sweep target with it:

  • sandwich cell_bpc lane: final_bpc
  • classification fallback: final_log_loss

Supporting metrics are:

  • sandwich cell_bpc lane: final_bpf, plus classification diagnostics when those are also reported
  • binary classification fallback: final_brier_score, final_roc_auc, best_roc_auc, final_minus_best
  • multiclass classification: final_brier_score, with ROC AUC retained only as a diagnostic when it is reported
  • training-time deltas versus the anchor
  • manifest and preprocessing surface deltas recorded in training_surface_record.json
  • loss or gradient instability evidence from train_history.jsonl, gradient_history.jsonl, and telemetry.json

best_roc_auc remains a tie-breaker and diagnostic for classification sweeps, not the main score, and final_log_loss becomes a fallback rather than the primary benchmark score when final_bpc is available.

Locked Anchor Surface

Hold this surface fixed unless the queue row explicitly declares a different dimension family:

  • selected sweep metadata: reference/system_delta_sweeps/<sweep_id>/sweep.yaml
  • selected canonical queue: reference/system_delta_sweeps/<sweep_id>/queue.yaml
  • selected canonical matrix: reference/system_delta_sweeps/<sweep_id>/matrix.md
  • selected anchor run id: anchor_run_id from the chosen sweep.yaml
  • canonical benchmark bundle: benchmark_bundle_path from the chosen sweep.yaml
  • canonical control baseline id: control_baseline_id from the chosen sweep.yaml
  • canonical benchmark registry: src/tab_foundry/bench/benchmark_run_registry_v1.json
  • delta catalog: reference/system_delta_catalog.yaml
  • sweep index: reference/system_delta_sweeps/index.yaml
  • research template: reference/system_delta_campaign_template.md
  • research sources: reference/stage_research_sources.yaml

Keep these invariant by default:

  • benchmark bundle path
  • control baseline id
  • the queue-declared training_experiment, training_config_profile, and surface_role
  • PFN control lane
  • hybrid diagnostic lane
  • canonical architecture-screen surface
  • history, checkpoint, benchmark, and training_surface_record.json artifact contracts

The benchmark registry is the historical system of record.

Registry-resolved outputs/staged_ladder/... artifact paths are convenience runtime references for local workspaces. They may be absent in a fresh clone or CI checkout. Resolve canonical identity through src/tab_foundry/bench/benchmark_run_registry_v1.json.

Dimension Families

This workflow is not architecture-only. Every queue row must isolate exactly one declared dimension family against the anchor:

  • model
  • training
  • data
  • preprocessing

Examples of valid dimensions include:

  • module selection inside tabfoundry_staged
  • training data source and manifest root
  • dagzoo provenance for a manifest-backed surface
  • runtime preprocessing and encoding policy

Any mechanism, data, or preprocessing candidate is allowed as long as the row states the exact preserved settings and the exact changed settings.

Queue And Matrix

The canonical source-of-truth hierarchy is:

  • reference/system_delta_catalog.yaml
  • reference/system_delta_sweeps/index.yaml
  • reference/system_delta_sweeps/<sweep_id>/queue.yaml
  • reference/system_delta_sweeps/<sweep_id>/matrix.md

There is no repo-global active sweep or generated top-level queue or matrix alias. Use explicit --sweep-id selection instead.

The queue must carry, at minimum:

  • order, delta_ref, status
  • description, rationale, hypothesis
  • model or data or preprocessing labels and the one active override family
  • parameter_adequacy_plan
  • execution_policy
  • run_id, followup_run_ids
  • decision, interpretation_status, confounders, next_action, notes

The matrix must be rerendered from the selected sweep plus the canonical benchmark registry. Metrics belong in the registry, not duplicated in the queue.

Use tab-foundry research sweep to:

  • create a new sweep with either a parent sweep or an explicit training surface
  • list rows in order for one explicit --sweep-id
  • print the next ready row for one explicit --sweep-id
  • validate completed rows for one explicit --sweep-id
  • render reference/system_delta_sweeps/<sweep_id>/matrix.md

Every benchmark-facing run belongs to exactly one sweep_id. New complexity passes should create a new sweep instead of mutating an old completed one.

Required Research Package

Before any empirical run for a queue row, create:

  • outputs/staged_ladder/research/<sweep_id>/<delta_id>/research_card.md
  • outputs/staged_ladder/research/<sweep_id>/<delta_id>/campaign.yaml

After a benchmark_full row is benchmarked and registered, also create:

  • outputs/staged_ladder/research/<sweep_id>/<delta_id>/result_card.md

Use reference/system_delta_campaign_template.md and reference/stage_research_sources.yaml.

Agents should use optional sibling-workspace sources when available, but must be able to proceed from the required repo-local sources alone.

Benchmark-facing conclusions must cite the locked bundle path, cls_benchmark_linear_v2, training_surface_record.json, research_card.md, campaign.yaml, and result_card.md. Evidence collected only on the hybrid diagnostic lane may guide diagnosis, but it is not by itself benchmark-facing promotion evidence for the architecture-screen surface.

Every completed run must have a training_surface_record.json artifact. That record is the system-surface evidence source for:

  • model surface labels and effective module selections
  • surfaced subsystem hyperparameters
  • data source and manifest fingerprint
  • dagzoo provenance references when applicable
  • dataset-characteristic summaries
  • preprocessing surface labels and explicit overrides

Queue reruns for instability debugging must also produce:

  • train_history.jsonl as the canonical scalar timeline
  • gradient_history.jsonl with module-level gradient traces
  • telemetry.json with run summary, artifact pointers, checkpoint snapshots, missingness diagnostics, and failure context

Treat the completed first-pass binary_md_v1 outputs under outputs/staged_ladder/sd_binary_md_v1_* as read-only baseline evidence. Do not overwrite those run directories when adding instability instrumentation. Use fresh rerun roots such as outputs/staged_ladder/<run_id>_diag_v1/train. Historical runs can only be audited from their scalar histories; true module-level traces only exist for new reruns.

Execution Loop

For each queue row:

  1. Select the sweep explicitly with --sweep-id and load reference/system_delta_sweeps/<sweep_id>/sweep.yaml.
  2. Select the next status=ready row from reference/system_delta_sweeps/<sweep_id>/queue.yaml.
  3. Write or update research_card.md and campaign.yaml.
  4. Train on the locked anchor surface, changing only the declared dimension.
  5. Ensure the run has training_surface_record.json, gradient_history.jsonl, and telemetry.json.
  6. If execution_policy=screen_only, stop after recording screen metrics in the queue and rerender the matrix; skip benchmark registration, do not write result_card.md, and treat the row as diagnostic only.
  7. If execution_policy=benchmark_full, benchmark on the bundle declared by the selected sweep metadata.
  8. If execution_policy=benchmark_full, register the benchmark-facing run in src/tab_foundry/bench/benchmark_run_registry_v1.json, including its sweep_id.
  9. If execution_policy=benchmark_full, write result_card.md.
  10. Rerender reference/system_delta_sweeps/<sweep_id>/matrix.md.
  11. Update the queue row status, run ids, interpretation, and next action.

To rank the existing first-pass binary_md_v1 outputs before rerunning, generate the scalar instability audit report under outputs/staged_ladder/reports/ with:

.venv/bin/python scripts/bench/instability_audit.py \
  --staged-ladder-root outputs/staged_ladder \
  --sweep-id binary_md_v1

This pass is attribution-first. No row becomes the new base during the sweep. screen_only rows are not benchmark-facing replacements for the anchor.

Decisions

Use these decisions:

  • keep: the row is isolated, evidence is at least neutral or improved on the task-family primary final metric (final_bpc when present, otherwise the current classification fallback such as final_log_loss), and the interpretation does not reveal unresolved confounding severe enough to block the signal
  • defer: evidence is mixed, the row is not isolated enough yet, or the introduced degrees of freedom have not been checked adequately
  • reject: only allowed when the row is isolated, the adequacy plan was completed, and the result is clearly worse without offsetting benefit

Underperformance alone is not enough for reject.

Every result_card.md must explain:

  • what changed
  • what metrics moved versus the anchor
  • whether the change was actually isolated
  • whether introduced hyperparameters were adequate
  • why the change may have helped or hurt
  • which confounders remain
  • what bounded follow-up is still required, if any

Rows like delta_row_cls_pool are the template for this policy. If row_cls underperforms once, that is not evidence against the mechanism by itself. The result card must discuss whether tfrow_n_heads, tfrow_n_layers, and tfrow_cls_tokens were likely appropriate before any negative conclusion is treated as strong evidence.

The same rule applies to data and preprocessing rows. If a data-surface or preprocessing row underperforms, the result card must discuss manifest adequacy, preprocessing adequacy, and any remaining boundary ambiguity before a negative read is treated as strong evidence.