Literature Evidence Mapping
Map from roadmap claims to papers and repo-local evidence.
Use this map when you need to trace a roadmap claim back to papers and the most important repo-local evidence.
Use these alongside this map:
- Canonical roadmap:
docs/development/roadmap.md - Design decisions and repo structure:
docs/development/design-decisions.md - Architecture reference:
docs/development/model-architecture.md - Paper index:
reference/papers.md - Detailed research-epic notes:
reference/roadmap_evidence/README.md
Conventions:
- Confidence:
high,medium, orlow - Lane: roadmap milestone lane supported by this evidence (
Now,Next,Later)
Detailed Research Epic Notes
reference/evidence.md remains the compact cross-epic map. Canonical long-form
evidence for TF-RD-018 and later research-oriented epics now lives under
reference/roadmap_evidence/.
TF-RD-018: historical staged-control training-surface closeout and partial adequacy evidenceTF-RD-020: harder dagzoo corpus-front evidence and v1 carry-forward decisionsTF-RD-021: steering-derived corpus-front evidence and carried-slice handoffTF-RD-021A: fixed-latent sandwich candidate note and closed immediate nanoTabPFN screenTF-RD-021B: hybrid full-cell sandwich successor note plus keep-current-anchor decision and classification-scaling-prep handoffTF-RD-022: runtime and VRAM efficiency evidence plus the measured-policy pre-scaling handoffTF-RD-016: architecture-surface closeout evidence and benchmark-evolution handoffTF-RD-010: benchmark-defined multiclass evolution evidence on the classification-first sandwich targetTF-RD-009: classification-first scaling-law design evidence and the runtime plus harder-surface handoffTF-RD-014: missingness robustness evidence and benchmark-ladder framingTF-RD-017: class-imbalance robustness evidence and reporting contractTF-RD-015: deferred regression rebuild evidence after the classification-first scaling programTF-RD-012: inference handoff and later-modality evidence
Evidence-to-Roadmap Mapping
| Source | Key Claim Used | Roadmap IDs | Lane | Confidence |
|---|---|---|---|---|
| TabPFN v2 (Nature, 2024) | PFN-style cell-table attention remains the frozen control lineage and the main structural comparison point | TF-RD-001, TF-RD-003, TF-RD-008 | Now | high |
| nanoTabPFN (repo) | Training recipe, model sizing, and benchmark comparison define the practical PFN control lane | TF-RD-001, TF-RD-008 | Now | high |
| TabICLv2 (2602.11139) | The row-first architecture direction is the main external reference: grouped feature embedding, row embedding, row-level ICL, and optional QASS | TF-RD-003, TF-RD-004, TF-RD-005, TF-RD-006, TF-RD-007, TF-RD-008 | Now | high |
| TabICL (2502.05564) | Staged migration and curriculum-style complexity changes should be introduced in a controlled ladder rather than as a single architecture jump | TF-RD-002, TF-RD-003, TF-RD-008 | Now | high |
| Deep Sets (1703.06114) | Permutation-aware row and column structure should remain the default lens for new row-first work | TF-RD-004, TF-RD-005, TF-RD-006, TF-RD-007 | Now | high |
| Set Transformer (1810.00825) | Inducing-point and set-style reasoning are the default reference for column-set work | TF-RD-006, TF-RD-007 | Now | high |
| FT-Transformer (2106.11959) | Per-feature embedding is the baseline tokenization reference that grouped-token work must beat or justify departing from | TF-RD-003, TF-RD-004 | Now | high |
| On Embeddings for Numerical Features in Tabular Deep Learning (2203.05556) | Value embedding choices are a high-leverage architectural lever and should be made on the shared surface, not hidden behind the nano encoder path | TF-RD-003, TF-RD-004 | Now | high |
| A Closer Look at TabPFN v2 (2502.17361) | Architecture robustness claims should be tested structurally and not inferred only from single benchmark endpoints | TF-RD-002, TF-RD-008, TF-RD-009 | Now | medium-high |
| McCandlish et al. (1812.06162) | Useful batch size is governed by diminishing returns rather than monotonic larger-is-better behavior | TF-RD-018, TF-RD-009 | Next | high |
| Shallue et al. (1811.03600) | Batch-size and data-parallel speedups are workload-specific and should be evaluated under matched budgets | TF-RD-018 | Next | high |
| Smith et al. (1711.00489) | Batch size and LR decay interact enough that schedule follow-up should start only after the preferred batch rung is chosen | TF-RD-018 | Next | medium-high |
| Goyal et al. (1706.02677) | Linear LR scaling plus warmup is a useful baseline heuristic, but not a universal Adam-style law | TF-RD-018 | Next | medium |
| Surge Phenomenon (2405.14578) | Adam-style optimal learning rate need not scale linearly with batch size, so LR should be retuned after the batch decision | TF-RD-018, TF-RD-009 | Next | medium-high |
| Nado et al. (2102.06356) | Conservative optimizer baselines should be exhausted before escalating to specialized large-batch optimizers | TF-RD-018 | Next | medium-high |
| Keskar et al. (1609.04836) | Large-batch optimization and generalization risks should be checked empirically rather than assumed away | TF-RD-018 | Next | medium |
| Islamov et al. (2603.21191) | Recent theory and experiments reinforce regime-dependent batch-size gains and diminishing returns under a fixed-budget lens | TF-RD-018 | Next | medium |
| SGDR (1608.03983) | Schedule work remains relevant, but should support the promoted architecture rather than substitute for it | TF-RD-001, TF-RD-009 | Next | medium-high |
| muP (2203.03466) | Hyperparameter transfer across model sizes matters once one coherent row-first anchor exists | TF-RD-009 | Next | high |
| Chinchilla (2203.15556) | Compute-optimal reasoning belongs on the classification-first sandwich target, not on transient hybrid bridge surfaces | TF-RD-009 | Next | high |
| Kaplan et al. (2001.08361) | Scaling trends should be measured on one coherent family rather than across mixed architectural lines | TF-RD-009 | Next | high |
| Power Lines (2505.13738) | Small-scale scaling fits need explicit artifact paths and careful interpretation | TF-RD-009 | Next | medium-high |
| Broken Neural Scaling Laws (2210.14891) | Knees and plateaus must be treated as expected outcomes when the architecture target is still moving | TF-RD-009 | Next | medium |
| SAINT (2106.01342) | Row/column interaction changes should be benchmarked as explicit structural choices rather than imported as defaults | TF-RD-005, TF-RD-006 | Next | medium |
| Perceiver (2103.03206) | Latent bottlenecks remain a later alternative if row/column token count becomes the limiting factor | TF-RD-006, TF-RD-012 | Later | medium |
| EquiTabPFN (2502.06684) | Label-conditioning choices should stay modular so target handling can evolve after the classification-first sandwich family stabilizes | TF-RD-010, TF-RD-012 | Next | medium |
| TabDPT (2410.18164) | Prior/source changes are a later scaling lever and should not displace the core architecture migration prematurely | TF-RD-011, TF-RD-012 | Later | medium |
| Sentence-BERT (D19-1410) | Text-conditioned columns should remain an external-embedding later lane rather than a distraction from the classification-first backbone plan | TF-RD-012 | Later | high |
| nanochat (repo) | Compact-transformer training and residual-layout ideas are valid donors, but only when they preserve the tabular set-structured goal | TF-RD-002, TF-RD-007, TF-RD-009 | Now | high |
Per-Roadmap Evidence Notes
TF-RD-001: Control Freeze And Experiment Trust
- External evidence:
- TabPFN v2 and nanoTabPFN define the PFN-style control lineage that should stay frozen for benchmark trust
- Repo-local evidence:
tabfoundry_simpleandstage=nano_exactalready serve as the cleanest PFN controls- the current large-anchor
nano_exact + prenorm + row_clsline is useful diagnostic evidence, but it is structurally hybrid and should not be treated as the promoted target by default
- Success signal:
- control-lane claims and target-lane claims are clearly separated in research docs and result interpretation
TF-RD-002: Measurement Surfaces For Architecture Migration
- External evidence:
- TabICL and the TabPFN analysis literature both imply that architecture interpretation needs more than final benchmark metrics
- nanochat is a good recipe reference for clean instrumentation and readable training diagnostics
- Repo-local evidence:
- the exact-prior path already emits module gradients, activation traces, and additive telemetry summaries
- the regular architecture-screen trainer still lacks that parity, so the
most decision-critical gap is coverage on
cls_benchmark_staged_corpus
- Success signal:
- row-first rows can be judged on stage-local stability and quality directly from the regular architecture-screen lane
TF-RD-003: Shared-Surface Unlock
- External evidence:
- TabICLv2 assumes the architecture target is already operating on a non-PFN-only embedding surface
- FT-Transformer and numerical-embedding references support moving feature handling out of the frozen nano path before testing richer structure
- Repo-local evidence:
- tokenizer overrides are ineffective under the nano encoder path
- the staged recipe ladder already treats
shared_normandprenorm_blockas the bridge out of the PFN control lane
- Success signal:
- TabICL-inspired work happens on a surface where tokenization and later modules are actually active
TF-RD-004: Tokenization Migration
- External evidence:
- TabICLv2’s grouped embedding idea is the main directional reference
- FT-Transformer is the baseline that grouped tokens must justify departing from
- Deep Sets reinforces that tokenization should preserve a set-structured view of rows and columns
- Repo-local evidence:
- grouped tokens already exist as a staged recipe
- earlier tokenizer experiments on the nano path were not attributable
- Success signal:
- grouped tokens are validated as part of the row-first ladder rather than as a disconnected ablation
TF-RD-005: Row-Embedding Unlock
- External evidence:
- TabICLv2’s decisive architectural move is to form row embeddings before the final ICL stage
- SAINT is a useful benchmark-first comparison point for row interaction once tokenization is coherent
- Repo-local evidence:
row_cls_poolexists as a staged step in the intended ladder- compact-surface row-CLS evidence was strongly negative, but it was gathered on PFN-adjacent surfaces and should be scoped that way
- Success signal:
- the repo reaches clear separate yes/no answers on useful row embeddings and plain row-level context in the intended migration line
TF-RD-006: Column-Set Integration
- External evidence:
- Set Transformer is the default reference for column-set reasoning
- TabICLv2 and related row-first designs make column embedding a first-class stage rather than a late add-on
- Repo-local evidence:
column_setexists as a staged step- old TFCol evidence on compact surfaces was stable but near-neutral and expensive
- Success signal:
- TFCol is either promoted with explicit value evidence or deferred with clear cost-based reasoning
TF-RD-007: Row-Level Context And QASS Attribution
- External evidence:
- TabICLv2 provides the main QASS and row-level ICL reference
- Deep Sets and Set Transformer imply that QASS should be justified as a structural benefit, not presumed mandatory
- nanochat supports the repo stance that simpler non-QASS alternatives should stay easy to compare
- Repo-local evidence:
- QASS primitives already exist
- compact-surface QASS rows trained cleanly but did not prove enough value to promote the mechanism
- Success signal:
- the repo can separately answer whether row-level context helps and whether QASS helps beyond plain row-level context
TF-RD-008: Coherent Classification Anchor Promotion
- External evidence:
- TabICLv2 argues for a coherent row-first architecture, not a permanent hybrid of PFN control pieces and row-first readout patches
- TabPFN analysis work reinforces that robustness claims should be grounded in coherent model surfaces
- Repo-local evidence:
- the staged family contained two final row-first promotion candidates on the
missing-permitting large bundle:
row_cls + qass + no tfcolandrow_cls + qass + tfcol_heads4 qass_tfcol_large_missing_validation_v1closed on a mixed result: the TFCol row improved final Brier and ROC AUC, but its final log loss was slightly worse than the simpler no-TFCol control- TF-RD-008 therefore settled on
row_cls + qass + no tfcolas the default row-first anchor, withrow_cls + qass + tfcol_heads4retained as a calibration-oriented alternative
- the staged family contained two final row-first promotion candidates on the
missing-permitting large bundle:
- Success signal:
- one named default row-first classification anchor now exists and can serve as the target for future architecture and scaling work without erasing the retained calibration-oriented alternative
TF-RD-018: Training-Surface Adequacy On The Promoted Anchor
- External evidence:
- McCandlish et al. and Shallue et al. imply that useful batch size is a workload-specific diminishing-returns question, not a monotonic larger-is-better knob
- Smith et al. and the Adam-style surge paper imply that batch, LR, and schedule should be treated as coupled, but that Adam-family follow-up does not justify a universal linear LR scaling rule
- Nado et al. support exhausting strong Adam-family baselines before treating specialized large-batch optimizers as the default next step
- Keskar et al. remains the classic cautionary reference for explicit stability and generalization checks when batch size grows
- Repo-local evidence:
- TF-RD-013 settled
tf_rd_013_dagzoo_shape_aware_size_medium_v1as the representative post-008 training-data surface row_first_training_adequacy_v1completed the first manifest-backedtask_batch_sizeladder on that medium surface under#109TF-RD-020now records the adjacent harder dagzoo synthetic front winners before TF-RD-018 resumes optimizer-family follow-up- issue
#147now records the canonical TF-RD-020 harder-front ladder plus the matchingtf_rd_020_*_v1corpus recipes - the canonical long-form note now lives in
reference/roadmap_evidence/tf_rd_018_training_surface_adequacy.md, with the scaling-specific handoff recorded inreference/roadmap_evidence/tf_rd_009_scaling_law_measurement.md
- TF-RD-013 settled
- Success signal:
- the repo retains a clear partial closeout record for the staged-control training-surface lane without treating the unfinished LR or clipping work as a blocker for sandwich-first planning
TF-RD-021: Steering-Derived Dagzoo Corpus Fronts On The Promoted Anchor
- External evidence:
- dedicated literature for coverage-steered synthetic harder fronts is not yet curated in this repo
- the existing TabPFN-v2 analysis note implies that meta-feature sensitivity should be tested structurally rather than inferred from one frozen surface
- Repo-local evidence:
- TF-RD-020 closed as historical staged-control harder-front context under
#146/#149; it is no longer the active carried classification slice - TF-RD-010 now owns the first carried sandwich dagzoo many-class plus missingness slice that steering will attempt to improve
- dagzoo issue
bensonlee5/dagzoo#246now defines the upstream steering implementation, metadata, and diagnostics lane - tab-foundry issues
#165and#167now own the local steering-derived continuation and first sweep contract - the canonical long-form note now lives in
reference/roadmap_evidence/tf_rd_021_steering_derived_dagzoo_corpus_fronts.md
- TF-RD-020 closed as historical staged-control harder-front context under
- Success signal:
- the repo records one explicit keep/defer decision on whether a steering-derived corpus front changes the carried sandwich dagzoo slice
TF-RD-022: Training Runtime And VRAM Efficiency Before Classification Scaling
- External evidence:
- dedicated runtime-policy literature is not yet curated in this repo
- the next sources to curate are PyTorch bf16, activation-checkpointing, and throughput or memory telemetry references for A100-class training
- Repo-local evidence:
- deferred issue
#58already tracks runtime or VRAM summary work but stayed attached to the earlier TF-RD-002 measurement chain - epic
#168plus child issues#169,#170, and#171now give the runtime lane an explicit roadmap home - the canonical long-form note now lives in
reference/roadmap_evidence/tf_rd_022_training_runtime_vram_efficiency.md - sandwich architecture ownership now lives under the historical
implementation record
#174, active umbrella#178, and completed anchor-retention decision#184; this runtime lane is a dependency surface for later classification work rather than the owner of sandwich planning - training telemetry and benchmark-registry artifacts now preserve runtime summaries and regime-budget metadata needed for later runtime-policy and scaling comparisons
- deferred issue
- Success signal:
- the repo records one explicit kernel/runtime policy with peak-memory and throughput evidence, and later scaling work can inherit it
TF-RD-021A: Fixed-Latent Sandwich Candidate And NanoTabPFN Screen
- External evidence:
- Perceiver, Set Transformer, SAINT, and PFN-style tabular references justify latent bottlenecks, set-style aggregation, and train-conditioned tabular reads as relevant design space
- Repo-local evidence:
model.arch=tabfoundry_sandwichnow exists as a sandwich architecture family and TF-RD-021A is now closed as negative evidence for the earlier summary-bottleneck replay- issue
#174records the implementation landing - issue
#178now owns long-running sandwich stabilization and iteration - issue
#179owned the immediate nanoTabPFN latent-count and width screen and now closes on row01as stable-but-underpowered evidence - the canonical long-form note now lives in
reference/roadmap_evidence/tf_rd_021a_latent_bank_sandwich_prototype.md
- Success signal:
- the repo records one explicit closeout for the summary-bottleneck replay and leaves the abandoned latent/width ladder as deferred backlog rather than pretending it is still the next execution target
TF-RD-021B: Hybrid Full-Cell Sandwich Successor, Simplification, And Classification-First Scaling Prep
- External evidence:
- PerceiverIO-style query readout and latent-bottleneck references justify reopening the input and readout path after the summary-bottleneck replay underfit
- Repo-local evidence:
- successor replay issue
#181now records the first bounded replay for the hybrid full-cell sandwich successor under umbrella issue#178 tabfoundry_sandwichnow uses a hybrid stage-0full-cell-plus-summary read, later summary-only repeated stages, and latent-then-full-cell readout- the compact hybrid control
tf_rd_021b_hybrid_full_cell_compact_prior_v1is now benchmarked at final ROC AUC0.7370, final log loss0.4672, and final Brier0.3072on the pinned medium binary bundle without an external comparator - child issues
#182and#183closed the 9-run knob screen and bounded width or head follow-up, while#184completed the four-row removal-first package and kept the compact hybrid control as the carry-forward parent - the canonical long-form note now lives in
reference/roadmap_evidence/tf_rd_021b_hybrid_full_cell_sandwich_successor.md
- successor replay issue
- Success signal:
- the repo records the bounded TF-RD-021B simplification package, explicitly keeps the compact hybrid control as the sandwich parent, and carries that parent onto harder classification surfaces before any single-toggle scaling recipe is considered
TF-RD-009: Scaling-Law Design And Measurement On The Classification-First Sandwich Target
- External evidence:
- Chinchilla, Kaplan, Power Lines, and Broken Neural Scaling Laws all require a stable family and comparable artifacts
- μP is the strongest width-transfer prior, while newer width-depth work should be treated as theory-informed but still empirical for sandwich
- the TF-RD-018 batch/LR literature and newer optimizer-budget scaling work remain useful historical context, but the active scaling path now depends on sandwich-specific dagzoo, steering, and runtime decisions
- Repo-local evidence:
- tuning and comparison tooling are already present
- runtime-summary and regime-budget artifacts now exist, but scaling-law artifacts are not yet canonical on the carried dagzoo classification slice
- TF-RD-018 is now historical closeout evidence only and is no longer a blocker for the first sandwich scaling fit
- TF-RD-022 must still hand back one measured runtime policy
- TF-RD-021 must still hand back one keep/defer steering decision on the carried dagzoo slice
- the completed TF-RD-021B keep-current-anchor decision under
#184is the precursor for the hybrid sandwich family and does not close TF-RD-009 by itself - the canonical long-form note now lives in
reference/roadmap_evidence/tf_rd_009_scaling_law_measurement.md
- Success signal:
- the repo fits classification scaling laws on the simplified sandwich family under one harder dagzoo slice, one inherited runtime policy, and one matched regime-budget contract
TF-RD-010: Benchmark-Defined Multiclass Evolution On The Classification-First Sandwich Target
- External evidence:
- label-conditioning work such as EquiTabPFN matters more after the backbone is coherent
- Repo-local evidence:
dagzoois now the explicit owner of the synthetic training frontstab-realdata-hubissue#1is now the upstream owner of the medium and large classification validation bundles and manifest materialization flow- the evolved sandwich benchmark config fixes FiLM,
sandwich_summary_tokens_per_axis=3,many_class_base=10, and a direct multiclass head - TF-RD-010 now has explicit draft medium and large sweep contracts rather than only a conceptual carried slice
- Success signal:
- the classification benchmark contract is evaluated through a fixed dagzoo-to-hub linkage on the evolved sandwich family
TF-RD-011: Repo-Wide Enablers And Contract Fidelity
- External evidence:
- TabDPT and related scaling work justify keeping source/prior flexibility in view, but not ahead of the architecture target
- Repo-local evidence:
- manifest-backed training, shared preprocessing work, and v3 export foundations already exist
- corpus provenance and end-to-end contract fidelity still need work
- Success signal:
- the architecture program can rely on data and contract surfaces that are trustworthy enough for promotion decisions
TF-RD-012: Inference Handoff And Later Modalities
- External evidence:
- text-conditioning references and later modality papers belong to a later lane once the classification anchor is stable
- Repo-local evidence:
- regression is intentionally removed today
- runtime handoff remains partial and should follow the classification-first sandwich base
- Success signal:
- later prediction modes and downstream handoff build on the classification-first sandwich base instead of competing with the main architecture program
Evidence Limits And Assumptions
- This is a planning artifact, not a reproduction benchmark.
- The roadmap is TabICLv2-inspired, not a literal TabICLv2 parity plan.
- Repo-local negative evidence for row-CLS, TFCol, and QASS should be scoped to
the surfaces on which it was gathered, especially the compact
nano_exactline. - PFN-style controls remain necessary even if the long-term target becomes more row-first.
- Sequence-order-centric mechanisms remain low priority unless a tabular justification appears.