Literature Evidence Mapping

Map from roadmap claims to papers and repo-local evidence.

Use this map when you need to trace a roadmap claim back to papers and the most important repo-local evidence.

Use these alongside this map:

  • Canonical roadmap: docs/development/roadmap.md
  • Design decisions and repo structure: docs/development/design-decisions.md
  • Architecture reference: docs/development/model-architecture.md
  • Paper index: reference/papers.md
  • Detailed research-epic notes: reference/roadmap_evidence/README.md

Conventions:

  • Confidence: high, medium, or low
  • Lane: roadmap milestone lane supported by this evidence (Now, Next, Later)

Detailed Research Epic Notes

reference/evidence.md remains the compact cross-epic map. Canonical long-form evidence for TF-RD-018 and later research-oriented epics now lives under reference/roadmap_evidence/.

  • TF-RD-018: historical staged-control training-surface closeout and partial adequacy evidence
  • TF-RD-020: harder dagzoo corpus-front evidence and v1 carry-forward decisions
  • TF-RD-021: steering-derived corpus-front evidence and carried-slice handoff
  • TF-RD-021A: fixed-latent sandwich candidate note and closed immediate nanoTabPFN screen
  • TF-RD-021B: hybrid full-cell sandwich successor note plus keep-current-anchor decision and classification-scaling-prep handoff
  • TF-RD-022: runtime and VRAM efficiency evidence plus the measured-policy pre-scaling handoff
  • TF-RD-016: architecture-surface closeout evidence and benchmark-evolution handoff
  • TF-RD-010: benchmark-defined multiclass evolution evidence on the classification-first sandwich target
  • TF-RD-009: classification-first scaling-law design evidence and the runtime plus harder-surface handoff
  • TF-RD-014: missingness robustness evidence and benchmark-ladder framing
  • TF-RD-017: class-imbalance robustness evidence and reporting contract
  • TF-RD-015: deferred regression rebuild evidence after the classification-first scaling program
  • TF-RD-012: inference handoff and later-modality evidence

Evidence-to-Roadmap Mapping

SourceKey Claim UsedRoadmap IDsLaneConfidence
TabPFN v2 (Nature, 2024)PFN-style cell-table attention remains the frozen control lineage and the main structural comparison pointTF-RD-001, TF-RD-003, TF-RD-008Nowhigh
nanoTabPFN (repo)Training recipe, model sizing, and benchmark comparison define the practical PFN control laneTF-RD-001, TF-RD-008Nowhigh
TabICLv2 (2602.11139)The row-first architecture direction is the main external reference: grouped feature embedding, row embedding, row-level ICL, and optional QASSTF-RD-003, TF-RD-004, TF-RD-005, TF-RD-006, TF-RD-007, TF-RD-008Nowhigh
TabICL (2502.05564)Staged migration and curriculum-style complexity changes should be introduced in a controlled ladder rather than as a single architecture jumpTF-RD-002, TF-RD-003, TF-RD-008Nowhigh
Deep Sets (1703.06114)Permutation-aware row and column structure should remain the default lens for new row-first workTF-RD-004, TF-RD-005, TF-RD-006, TF-RD-007Nowhigh
Set Transformer (1810.00825)Inducing-point and set-style reasoning are the default reference for column-set workTF-RD-006, TF-RD-007Nowhigh
FT-Transformer (2106.11959)Per-feature embedding is the baseline tokenization reference that grouped-token work must beat or justify departing fromTF-RD-003, TF-RD-004Nowhigh
On Embeddings for Numerical Features in Tabular Deep Learning (2203.05556)Value embedding choices are a high-leverage architectural lever and should be made on the shared surface, not hidden behind the nano encoder pathTF-RD-003, TF-RD-004Nowhigh
A Closer Look at TabPFN v2 (2502.17361)Architecture robustness claims should be tested structurally and not inferred only from single benchmark endpointsTF-RD-002, TF-RD-008, TF-RD-009Nowmedium-high
McCandlish et al. (1812.06162)Useful batch size is governed by diminishing returns rather than monotonic larger-is-better behaviorTF-RD-018, TF-RD-009Nexthigh
Shallue et al. (1811.03600)Batch-size and data-parallel speedups are workload-specific and should be evaluated under matched budgetsTF-RD-018Nexthigh
Smith et al. (1711.00489)Batch size and LR decay interact enough that schedule follow-up should start only after the preferred batch rung is chosenTF-RD-018Nextmedium-high
Goyal et al. (1706.02677)Linear LR scaling plus warmup is a useful baseline heuristic, but not a universal Adam-style lawTF-RD-018Nextmedium
Surge Phenomenon (2405.14578)Adam-style optimal learning rate need not scale linearly with batch size, so LR should be retuned after the batch decisionTF-RD-018, TF-RD-009Nextmedium-high
Nado et al. (2102.06356)Conservative optimizer baselines should be exhausted before escalating to specialized large-batch optimizersTF-RD-018Nextmedium-high
Keskar et al. (1609.04836)Large-batch optimization and generalization risks should be checked empirically rather than assumed awayTF-RD-018Nextmedium
Islamov et al. (2603.21191)Recent theory and experiments reinforce regime-dependent batch-size gains and diminishing returns under a fixed-budget lensTF-RD-018Nextmedium
SGDR (1608.03983)Schedule work remains relevant, but should support the promoted architecture rather than substitute for itTF-RD-001, TF-RD-009Nextmedium-high
muP (2203.03466)Hyperparameter transfer across model sizes matters once one coherent row-first anchor existsTF-RD-009Nexthigh
Chinchilla (2203.15556)Compute-optimal reasoning belongs on the classification-first sandwich target, not on transient hybrid bridge surfacesTF-RD-009Nexthigh
Kaplan et al. (2001.08361)Scaling trends should be measured on one coherent family rather than across mixed architectural linesTF-RD-009Nexthigh
Power Lines (2505.13738)Small-scale scaling fits need explicit artifact paths and careful interpretationTF-RD-009Nextmedium-high
Broken Neural Scaling Laws (2210.14891)Knees and plateaus must be treated as expected outcomes when the architecture target is still movingTF-RD-009Nextmedium
SAINT (2106.01342)Row/column interaction changes should be benchmarked as explicit structural choices rather than imported as defaultsTF-RD-005, TF-RD-006Nextmedium
Perceiver (2103.03206)Latent bottlenecks remain a later alternative if row/column token count becomes the limiting factorTF-RD-006, TF-RD-012Latermedium
EquiTabPFN (2502.06684)Label-conditioning choices should stay modular so target handling can evolve after the classification-first sandwich family stabilizesTF-RD-010, TF-RD-012Nextmedium
TabDPT (2410.18164)Prior/source changes are a later scaling lever and should not displace the core architecture migration prematurelyTF-RD-011, TF-RD-012Latermedium
Sentence-BERT (D19-1410)Text-conditioned columns should remain an external-embedding later lane rather than a distraction from the classification-first backbone planTF-RD-012Laterhigh
nanochat (repo)Compact-transformer training and residual-layout ideas are valid donors, but only when they preserve the tabular set-structured goalTF-RD-002, TF-RD-007, TF-RD-009Nowhigh

Per-Roadmap Evidence Notes

TF-RD-001: Control Freeze And Experiment Trust

  • External evidence:
    • TabPFN v2 and nanoTabPFN define the PFN-style control lineage that should stay frozen for benchmark trust
  • Repo-local evidence:
    • tabfoundry_simple and stage=nano_exact already serve as the cleanest PFN controls
    • the current large-anchor nano_exact + prenorm + row_cls line is useful diagnostic evidence, but it is structurally hybrid and should not be treated as the promoted target by default
  • Success signal:
    • control-lane claims and target-lane claims are clearly separated in research docs and result interpretation

TF-RD-002: Measurement Surfaces For Architecture Migration

  • External evidence:
    • TabICL and the TabPFN analysis literature both imply that architecture interpretation needs more than final benchmark metrics
    • nanochat is a good recipe reference for clean instrumentation and readable training diagnostics
  • Repo-local evidence:
    • the exact-prior path already emits module gradients, activation traces, and additive telemetry summaries
    • the regular architecture-screen trainer still lacks that parity, so the most decision-critical gap is coverage on cls_benchmark_staged_corpus
  • Success signal:
    • row-first rows can be judged on stage-local stability and quality directly from the regular architecture-screen lane

TF-RD-003: Shared-Surface Unlock

  • External evidence:
    • TabICLv2 assumes the architecture target is already operating on a non-PFN-only embedding surface
    • FT-Transformer and numerical-embedding references support moving feature handling out of the frozen nano path before testing richer structure
  • Repo-local evidence:
    • tokenizer overrides are ineffective under the nano encoder path
    • the staged recipe ladder already treats shared_norm and prenorm_block as the bridge out of the PFN control lane
  • Success signal:
    • TabICL-inspired work happens on a surface where tokenization and later modules are actually active

TF-RD-004: Tokenization Migration

  • External evidence:
    • TabICLv2’s grouped embedding idea is the main directional reference
    • FT-Transformer is the baseline that grouped tokens must justify departing from
    • Deep Sets reinforces that tokenization should preserve a set-structured view of rows and columns
  • Repo-local evidence:
    • grouped tokens already exist as a staged recipe
    • earlier tokenizer experiments on the nano path were not attributable
  • Success signal:
    • grouped tokens are validated as part of the row-first ladder rather than as a disconnected ablation

TF-RD-005: Row-Embedding Unlock

  • External evidence:
    • TabICLv2’s decisive architectural move is to form row embeddings before the final ICL stage
    • SAINT is a useful benchmark-first comparison point for row interaction once tokenization is coherent
  • Repo-local evidence:
    • row_cls_pool exists as a staged step in the intended ladder
    • compact-surface row-CLS evidence was strongly negative, but it was gathered on PFN-adjacent surfaces and should be scoped that way
  • Success signal:
    • the repo reaches clear separate yes/no answers on useful row embeddings and plain row-level context in the intended migration line

TF-RD-006: Column-Set Integration

  • External evidence:
    • Set Transformer is the default reference for column-set reasoning
    • TabICLv2 and related row-first designs make column embedding a first-class stage rather than a late add-on
  • Repo-local evidence:
    • column_set exists as a staged step
    • old TFCol evidence on compact surfaces was stable but near-neutral and expensive
  • Success signal:
    • TFCol is either promoted with explicit value evidence or deferred with clear cost-based reasoning

TF-RD-007: Row-Level Context And QASS Attribution

  • External evidence:
    • TabICLv2 provides the main QASS and row-level ICL reference
    • Deep Sets and Set Transformer imply that QASS should be justified as a structural benefit, not presumed mandatory
    • nanochat supports the repo stance that simpler non-QASS alternatives should stay easy to compare
  • Repo-local evidence:
    • QASS primitives already exist
    • compact-surface QASS rows trained cleanly but did not prove enough value to promote the mechanism
  • Success signal:
    • the repo can separately answer whether row-level context helps and whether QASS helps beyond plain row-level context

TF-RD-008: Coherent Classification Anchor Promotion

  • External evidence:
    • TabICLv2 argues for a coherent row-first architecture, not a permanent hybrid of PFN control pieces and row-first readout patches
    • TabPFN analysis work reinforces that robustness claims should be grounded in coherent model surfaces
  • Repo-local evidence:
    • the staged family contained two final row-first promotion candidates on the missing-permitting large bundle: row_cls + qass + no tfcol and row_cls + qass + tfcol_heads4
    • qass_tfcol_large_missing_validation_v1 closed on a mixed result: the TFCol row improved final Brier and ROC AUC, but its final log loss was slightly worse than the simpler no-TFCol control
    • TF-RD-008 therefore settled on row_cls + qass + no tfcol as the default row-first anchor, with row_cls + qass + tfcol_heads4 retained as a calibration-oriented alternative
  • Success signal:
    • one named default row-first classification anchor now exists and can serve as the target for future architecture and scaling work without erasing the retained calibration-oriented alternative

TF-RD-018: Training-Surface Adequacy On The Promoted Anchor

  • External evidence:
    • McCandlish et al. and Shallue et al. imply that useful batch size is a workload-specific diminishing-returns question, not a monotonic larger-is-better knob
    • Smith et al. and the Adam-style surge paper imply that batch, LR, and schedule should be treated as coupled, but that Adam-family follow-up does not justify a universal linear LR scaling rule
    • Nado et al. support exhausting strong Adam-family baselines before treating specialized large-batch optimizers as the default next step
    • Keskar et al. remains the classic cautionary reference for explicit stability and generalization checks when batch size grows
  • Repo-local evidence:
    • TF-RD-013 settled tf_rd_013_dagzoo_shape_aware_size_medium_v1 as the representative post-008 training-data surface
    • row_first_training_adequacy_v1 completed the first manifest-backed task_batch_size ladder on that medium surface under #109
    • TF-RD-020 now records the adjacent harder dagzoo synthetic front winners before TF-RD-018 resumes optimizer-family follow-up
    • issue #147 now records the canonical TF-RD-020 harder-front ladder plus the matching tf_rd_020_*_v1 corpus recipes
    • the canonical long-form note now lives in reference/roadmap_evidence/tf_rd_018_training_surface_adequacy.md, with the scaling-specific handoff recorded in reference/roadmap_evidence/tf_rd_009_scaling_law_measurement.md
  • Success signal:
    • the repo retains a clear partial closeout record for the staged-control training-surface lane without treating the unfinished LR or clipping work as a blocker for sandwich-first planning

TF-RD-021: Steering-Derived Dagzoo Corpus Fronts On The Promoted Anchor

  • External evidence:
    • dedicated literature for coverage-steered synthetic harder fronts is not yet curated in this repo
    • the existing TabPFN-v2 analysis note implies that meta-feature sensitivity should be tested structurally rather than inferred from one frozen surface
  • Repo-local evidence:
    • TF-RD-020 closed as historical staged-control harder-front context under #146/#149; it is no longer the active carried classification slice
    • TF-RD-010 now owns the first carried sandwich dagzoo many-class plus missingness slice that steering will attempt to improve
    • dagzoo issue bensonlee5/dagzoo#246 now defines the upstream steering implementation, metadata, and diagnostics lane
    • tab-foundry issues #165 and #167 now own the local steering-derived continuation and first sweep contract
    • the canonical long-form note now lives in reference/roadmap_evidence/tf_rd_021_steering_derived_dagzoo_corpus_fronts.md
  • Success signal:
    • the repo records one explicit keep/defer decision on whether a steering-derived corpus front changes the carried sandwich dagzoo slice

TF-RD-022: Training Runtime And VRAM Efficiency Before Classification Scaling

  • External evidence:
    • dedicated runtime-policy literature is not yet curated in this repo
    • the next sources to curate are PyTorch bf16, activation-checkpointing, and throughput or memory telemetry references for A100-class training
  • Repo-local evidence:
    • deferred issue #58 already tracks runtime or VRAM summary work but stayed attached to the earlier TF-RD-002 measurement chain
    • epic #168 plus child issues #169, #170, and #171 now give the runtime lane an explicit roadmap home
    • the canonical long-form note now lives in reference/roadmap_evidence/tf_rd_022_training_runtime_vram_efficiency.md
    • sandwich architecture ownership now lives under the historical implementation record #174, active umbrella #178, and completed anchor-retention decision #184; this runtime lane is a dependency surface for later classification work rather than the owner of sandwich planning
    • training telemetry and benchmark-registry artifacts now preserve runtime summaries and regime-budget metadata needed for later runtime-policy and scaling comparisons
  • Success signal:
    • the repo records one explicit kernel/runtime policy with peak-memory and throughput evidence, and later scaling work can inherit it

TF-RD-021A: Fixed-Latent Sandwich Candidate And NanoTabPFN Screen

  • External evidence:
    • Perceiver, Set Transformer, SAINT, and PFN-style tabular references justify latent bottlenecks, set-style aggregation, and train-conditioned tabular reads as relevant design space
  • Repo-local evidence:
    • model.arch=tabfoundry_sandwich now exists as a sandwich architecture family and TF-RD-021A is now closed as negative evidence for the earlier summary-bottleneck replay
    • issue #174 records the implementation landing
    • issue #178 now owns long-running sandwich stabilization and iteration
    • issue #179 owned the immediate nanoTabPFN latent-count and width screen and now closes on row 01 as stable-but-underpowered evidence
    • the canonical long-form note now lives in reference/roadmap_evidence/tf_rd_021a_latent_bank_sandwich_prototype.md
  • Success signal:
    • the repo records one explicit closeout for the summary-bottleneck replay and leaves the abandoned latent/width ladder as deferred backlog rather than pretending it is still the next execution target

TF-RD-021B: Hybrid Full-Cell Sandwich Successor, Simplification, And Classification-First Scaling Prep

  • External evidence:
    • PerceiverIO-style query readout and latent-bottleneck references justify reopening the input and readout path after the summary-bottleneck replay underfit
  • Repo-local evidence:
    • successor replay issue #181 now records the first bounded replay for the hybrid full-cell sandwich successor under umbrella issue #178
    • tabfoundry_sandwich now uses a hybrid stage-0 full-cell-plus-summary read, later summary-only repeated stages, and latent-then-full-cell readout
    • the compact hybrid control tf_rd_021b_hybrid_full_cell_compact_prior_v1 is now benchmarked at final ROC AUC 0.7370, final log loss 0.4672, and final Brier 0.3072 on the pinned medium binary bundle without an external comparator
    • child issues #182 and #183 closed the 9-run knob screen and bounded width or head follow-up, while #184 completed the four-row removal-first package and kept the compact hybrid control as the carry-forward parent
    • the canonical long-form note now lives in reference/roadmap_evidence/tf_rd_021b_hybrid_full_cell_sandwich_successor.md
  • Success signal:
    • the repo records the bounded TF-RD-021B simplification package, explicitly keeps the compact hybrid control as the sandwich parent, and carries that parent onto harder classification surfaces before any single-toggle scaling recipe is considered

TF-RD-009: Scaling-Law Design And Measurement On The Classification-First Sandwich Target

  • External evidence:
    • Chinchilla, Kaplan, Power Lines, and Broken Neural Scaling Laws all require a stable family and comparable artifacts
    • μP is the strongest width-transfer prior, while newer width-depth work should be treated as theory-informed but still empirical for sandwich
    • the TF-RD-018 batch/LR literature and newer optimizer-budget scaling work remain useful historical context, but the active scaling path now depends on sandwich-specific dagzoo, steering, and runtime decisions
  • Repo-local evidence:
    • tuning and comparison tooling are already present
    • runtime-summary and regime-budget artifacts now exist, but scaling-law artifacts are not yet canonical on the carried dagzoo classification slice
    • TF-RD-018 is now historical closeout evidence only and is no longer a blocker for the first sandwich scaling fit
    • TF-RD-022 must still hand back one measured runtime policy
    • TF-RD-021 must still hand back one keep/defer steering decision on the carried dagzoo slice
    • the completed TF-RD-021B keep-current-anchor decision under #184 is the precursor for the hybrid sandwich family and does not close TF-RD-009 by itself
    • the canonical long-form note now lives in reference/roadmap_evidence/tf_rd_009_scaling_law_measurement.md
  • Success signal:
    • the repo fits classification scaling laws on the simplified sandwich family under one harder dagzoo slice, one inherited runtime policy, and one matched regime-budget contract

TF-RD-010: Benchmark-Defined Multiclass Evolution On The Classification-First Sandwich Target

  • External evidence:
    • label-conditioning work such as EquiTabPFN matters more after the backbone is coherent
  • Repo-local evidence:
    • dagzoo is now the explicit owner of the synthetic training fronts
    • tab-realdata-hub issue #1 is now the upstream owner of the medium and large classification validation bundles and manifest materialization flow
    • the evolved sandwich benchmark config fixes FiLM, sandwich_summary_tokens_per_axis=3, many_class_base=10, and a direct multiclass head
    • TF-RD-010 now has explicit draft medium and large sweep contracts rather than only a conceptual carried slice
  • Success signal:
    • the classification benchmark contract is evaluated through a fixed dagzoo-to-hub linkage on the evolved sandwich family

TF-RD-011: Repo-Wide Enablers And Contract Fidelity

  • External evidence:
    • TabDPT and related scaling work justify keeping source/prior flexibility in view, but not ahead of the architecture target
  • Repo-local evidence:
    • manifest-backed training, shared preprocessing work, and v3 export foundations already exist
    • corpus provenance and end-to-end contract fidelity still need work
  • Success signal:
    • the architecture program can rely on data and contract surfaces that are trustworthy enough for promotion decisions

TF-RD-012: Inference Handoff And Later Modalities

  • External evidence:
    • text-conditioning references and later modality papers belong to a later lane once the classification anchor is stable
  • Repo-local evidence:
    • regression is intentionally removed today
    • runtime handoff remains partial and should follow the classification-first sandwich base
  • Success signal:
    • later prediction modes and downstream handoff build on the classification-first sandwich base instead of competing with the main architecture program

Evidence Limits And Assumptions

  • This is a planning artifact, not a reproduction benchmark.
  • The roadmap is TabICLv2-inspired, not a literal TabICLv2 parity plan.
  • Repo-local negative evidence for row-CLS, TFCol, and QASS should be scoped to the surfaces on which it was gathered, especially the compact nano_exact line.
  • PFN-style controls remain necessary even if the long-term target becomes more row-first.
  • Sequence-order-centric mechanisms remain low priority unless a tabular justification appears.