Papers And References
Start here when you want the reading list that actually informs architecture,
training, and scaling decisions in tab-foundry.
This list overlaps with ~/dev/dagzoo/reference, but the lens here is
architecture, training recipe, and scaling predictability rather than data
generation.
Use these alongside this reference set:
- Design decisions and repo structure:
docs/development/design-decisions.md - Roadmap:
docs/development/roadmap.md - Workflow runbooks:
docs/workflows.md - Evidence mapping:
reference/evidence.md
Adoption Tiers For This Repo
Use this directory to make explicit judgments, not just collect papers.
Likely adopt- numerical embeddings, including ordinal-as-numeric handling unless monotonicity is the central requirement
- categorical entity embeddings as the default baseline before heavier contextualization
- compact-transformer recipe ideas from
nanochatwhen they do not depend on sequence order - typed column encoders that preserve set- and permutation-aware structure
Benchmark first- TabTransformer contextual categorical embeddings
- SAINT row/column attention and Perceiver-style latent bottlenecks
- ColBERT-like late interaction or other factorized interaction patterns
- table-aware text models
- expensive optimizers beyond current defaults, especially because the intended transformer family is small
- retrieval-heavy or interaction-heavy mechanisms that materially change the inductive bias
Later / TF-RD-012- text-conditioned column handling via external text embeddings and table-aware text encoders
Probably low relevance- RoPE and similar positional schemes whose main value comes from token order in language
- causal-LM-specific sequence machinery that does not cleanly transfer to unordered rows or columns
Tabular Foundation Models
These are the core domain papers. Many overlap with dagzoo’s collection but are read here through an architecture and training lens.
| arXiv ID | Title | Why it matters for tab-foundry | Source |
|---|---|---|---|
| 2602.11139 | TabICLv2: A better, faster, scalable, and open tabular foundation model | Primary external architecture reference for the row-first target lane. Defines QASS attention, feature tokenization, and the staged training recipe that informs the migration ladder. | https://arxiv.org/abs/2602.11139 |
| 2502.05564 | TabICL: A Tabular Foundation Model for In-Context Learning on Large Data | Predecessor architecture; curriculum and staged complexity training details that inform the training loop design. | https://arxiv.org/abs/2502.05564 |
| — | Accurate predictions on small data with a tabular foundation model (TabPFN v2, Nature 2024) | Core PFN architecture and synthetic prior design; attention patterns and in-context learning mechanics that underpin the model family. | https://doi.org/10.1038/s41586-024-08328-6 |
| 2502.17361 | A Closer Look at TabPFN v2: Understanding Its Strengths and Extending Its Capabilities | Strengths/limitations analysis; meta-feature sensitivity insights relevant to architecture robustness. | https://arxiv.org/abs/2502.17361 |
| 2410.18164 | TabDPT: Scaling Tabular Foundation Models on Real Data | Real-data pretraining as an alternative to purely synthetic training; informs architecture decisions around data source flexibility. | https://arxiv.org/abs/2410.18164 |
| 2502.02527 | TabPFN Unleashed: A Scalable and Effective Solution to Tabular Classification Problems | Bias/variance control and large-scale adaptation techniques; relevant to scaling architecture decisions. | https://arxiv.org/abs/2502.02527 |
| 2311.10609 | Scaling TabPFN: Sketching and Feature Selection for Tabular Prior-Data Fitted Networks | Feature selection and compression for high-dimensional scalability; informs the feature tokenization pipeline. | https://arxiv.org/abs/2311.10609 |
| 2402.11137 | TuneTables: Context Optimization for Scalable Prior-Data Fitted Networks | Parameter-efficient PFN adaptation; context optimization patterns relevant to architecture modularity. | https://arxiv.org/abs/2402.11137 |
| 2406.05207 | Retrieval & Fine-Tuning for In-Context Tabular Models (LoCalPFN) | Retrieval-conditioned adaptation; architecture patterns for dataset-aware inference. | https://arxiv.org/abs/2406.05207 |
| 2502.06684 | EquiTabPFN: A Target-Permutation Equivariant Prior Fitted Networks | Target-permutation equivariance; robust label-space handling relevant to readout head design. | https://arxiv.org/abs/2502.06684 |
| 2411.10634 | Drift-Resilient TabPFN: In-Context Learning Temporal Distribution Shifts on Tabular Data | Architectural choices for temporal shift robustness; informs backbone design for distribution-shift tolerance. | https://arxiv.org/abs/2411.10634 |
Set And Permutation-Aware Modeling
These references matter because rows and columns in tabular models are closer to sets than to language sequences. They should generally outrank generic LLM positional ideas when the two conflict.
| arXiv ID | Title | Why it matters for tab-foundry | Source |
|---|---|---|---|
| 1703.06114 | Deep Sets | Canonical permutation-invariant baseline. Useful as the default mental model when deciding whether a row/column mechanism should care about order at all. | https://arxiv.org/abs/1703.06114 |
| 1810.00825 | Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks | Attention over unordered sets with inducing-point scaling. Especially relevant because the repo already uses ISAB-like components in the column path. | https://proceedings.mlr.press/v97/lee19d.html |
Typed Column Encoders
Treat tokenization here as typed column encoding, not just as a generic feature projection step. The question is how numerical, categorical, ordinal, and later text-conditioned columns should enter a small set-structured backbone.
Default typed-column policy:
- numerical and ordinal columns should default to numeric-style encoders
- categorical columns should default to learned embeddings before more complex contextualization
- text columns should default to external text embeddings first, not raw subword tokenization inside the main table backbone
- set/permutation-aware structure remains the governing constraint for how typed tokens enter the shared model
| arXiv ID | Title | Why it matters for tab-foundry | Source |
|---|---|---|---|
| 2106.11959 | Revisiting Deep Learning Models for Tabular Data (FT-Transformer) | Baseline per-feature tokenization reference. Useful as the simple typed-column baseline before heavier encoder choices are justified. | https://arxiv.org/abs/2106.11959 |
Numerical And Ordinal Columns
These are likely higher-leverage than sequence positional tricks for a small tabular transformer. Default repo stance: ordinals should usually be treated as numeric unless monotonicity is a central modeling requirement.
| arXiv ID | Title | Why it matters for tab-foundry | Source |
|---|---|---|---|
| 2203.05556 | On Embeddings for Numerical Features in Tabular Deep Learning | Direct reference for scalar-to-vector value embeddings. High-priority source for feature/value tokenization ideas that fit unordered tabular structure. | https://arxiv.org/abs/2203.05556 |
Categorical Columns
Categorical columns should default to learned embeddings before more expensive contextualization is introduced.
| arXiv ID | Title | Why it matters for tab-foundry | Source |
|---|---|---|---|
| 1604.06737 | Entity Embeddings of Categorical Variables | Likely-adopt baseline for categorical columns, especially when cardinality is high enough that one-hot handling is a poor fit. | https://arxiv.org/abs/1604.06737 |
| 2012.06678 | TabTransformer: Tabular Data Modeling Using Contextual Embeddings | Benchmark-first contextualization reference for categorical columns. Useful when plain learned embeddings are not enough. | https://arxiv.org/abs/2012.06678 |
Text-Conditioned Columns (Later / TF-RD-012)
These references belong to the later text-conditioned-input lane. The default direction is to use external text embeddings first, not raw subword tokenization inside the small tabular backbone.
| arXiv ID | Title | Why it matters for tab-foundry | Source |
|---|---|---|---|
| D19-1410 | Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks | Practical baseline for turning free-text cells or column values into fixed embeddings before they enter the tabular model. | https://aclanthology.org/D19-1410/ |
| 2020.acl-main.745 | TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data | Benchmark-first table-aware text model. Relevant if later work needs richer joint text-table reasoning rather than external text embeddings alone. | https://aclanthology.org/2020.acl-main.745/ |
| 2006.14806 | TURL: Table Understanding through Representation Learning | Later benchmark reference for structure-aware table/text representations. Useful when text-conditioned columns become a first-class roadmap item. | https://arxiv.org/abs/2006.14806 |
Scaling Row/Column Token Counts
These are benchmark-first references for scaling row/column interaction costs. They are architecture references, not default tokenization choices.
| arXiv ID | Title | Why it matters for tab-foundry | Source |
|---|---|---|---|
| 2106.01342 | SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training | Benchmark-first row/column attention reference. Useful if typed encoders become good enough that inter-row attention is the next bottleneck. | https://arxiv.org/abs/2106.01342 |
| 2103.03206 | Perceiver: General Perception with Iterative Attention | Latent-bottleneck reference for scaling to very large token counts. Relevant if row/column counts make full attention too expensive. | https://proceedings.mlr.press/v139/jaegle21a.html |
Late Interaction And Factorized Matching
These are benchmark-first references. They are interesting because tables are relatively order-light, but they should not become the default backbone assumption without evidence.
| arXiv ID | Title | Why it matters for tab-foundry | Source |
|---|---|---|---|
| 2004.12832 | ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT | Factorized late interaction is a plausible row/column matching pattern for tabular models, but it is a benchmark-first hypothesis rather than a default adopt item. | https://arxiv.org/abs/2004.12832 |
Compact Transformers And Training Recipes
Papers that directly serve the short-run training constraint and modular architecture goals. These are new to tab-foundry (not in dagzoo).
Default filter for this section:
- borrow compact-transformer recipe ideas when they do not depend on sequence order
- treat
nanochatas the main practical repo reference for FFN choice, residual/pre-norm block hygiene, optimizer partitioning, and model sizing - treat
ReLU^2as a concrete candidate for the baseline FFN recipe - spend optimization budget more freely than an LLM repo would, because the target transformer family is small
- keep RoPE and similar language-order mechanisms low priority unless a tabular-specific use case appears
| arXiv ID | Title | Why it matters for tab-foundry | Source |
|---|---|---|---|
| 2203.15556 | Training Compute-Optimal Large Language Models (Chinchilla) | Core reference for the repo’s primary goal of scaling predictability. Defines the methodology for fitting compute-optimal curves across model sizes. | https://arxiv.org/abs/2203.15556 |
| 2001.08361 | Scaling Laws for Neural Language Models | Foundational scaling law methodology; establishes the power-law relationships between compute, data, and model size that this repo aims to replicate for tabular models. | https://arxiv.org/abs/2001.08361 |
| 2203.03466 | Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (muP) | Width-independent hyperparameter transfer across model sizes. Directly relevant to scaling-law measurement on the classification-first sandwich target (TF-RD-009). | https://arxiv.org/abs/2203.03466 |
| 1608.03983 | SGDR: Stochastic Gradient Descent with Warm Restarts | Cosine annealing with warm restarts; relevant to training-recipe work that supports the classification-first sandwich target rather than replacing architecture migration (TF-RD-009). | https://arxiv.org/abs/1608.03983 |
| 2412.19437 | DeepSeek-V3 Technical Report | Multi-token prediction and modern training recipe details for compact transformers. Informs architecture choices for cross-feature dependency modeling. | https://arxiv.org/abs/2412.19437 |
Training-Surface Adequacy And Batch/LR Scaling
These references anchor the TF-RD-018 literature note and the later handoff from historical training-surface adequacy into scaling work on the classification-first sandwich target.
| arXiv ID | Title | Why it matters for tab-foundry | Source |
|---|---|---|---|
| 1812.06162 | An Empirical Model of Large-Batch Training | Best-fit reference for critical-batch and diminishing-returns framing. Supports treating TF-RD-018 as a search for the largest useful manifest task batch on the settled medium surface rather than assuming larger is always better. | https://arxiv.org/abs/1812.06162 |
| 1811.03600 | Measuring the Effects of Data Parallelism on Neural Network Training | Shows that large-batch payoff is workload-specific and should be judged under matched time and compute budgets. Relevant to TF-RD-018 runtime gates and later TF-RD-009 interpretation. | https://arxiv.org/abs/1811.03600 |
| 1711.00489 | Don’t Decay the Learning Rate, Increase the Batch Size | Useful reference for treating batch size and schedule as coupled knobs instead of independent sweeps. Relevant after TF-RD-018 chooses a preferred batch rung. | https://arxiv.org/abs/1711.00489 |
| 1706.02677 | Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour | Canonical linear-LR-scaling plus warmup reference. Useful as a baseline heuristic, but not a universal rule for Adam-style TF-RD-018 follow-up. | https://arxiv.org/abs/1706.02677 |
| 2405.14578 | Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling | Important caution that Adam-style optimizers need not follow simple linear LR scaling with batch size. Directly relevant to LR retuning after TF-RD-018 settles the batch rung. | https://arxiv.org/abs/2405.14578 |
| 2102.06356 | A Large Batch Optimizer Reality Check: A Case for Conservative Baselines | Supports exhausting strong Adam-family baselines before escalating to specialized large-batch optimizers. Relevant to TF-RD-018 optimizer-family follow-up. | https://arxiv.org/abs/2102.06356 |
| 1609.04836 | On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima | Classic cautionary reference for large-batch optimization and generalization risk. Useful as a failure-mode reminder, but should be interpreted alongside later matched-budget results. | https://arxiv.org/abs/1609.04836 |
| 2603.21191 | On the Role of Batch Size in Stochastic Conditional Gradient Methods | Recent batch-size scaling paper that reinforces regime-dependent gains and diminishing returns under a fixed-budget lens. Conceptually relevant to TF-RD-018, but not a direct AdamW prescription because the optimizer family differs. | https://arxiv.org/abs/2603.21191 |
Scaling Laws
Dedicated section since scaling predictability is the repo’s primary goal.
| arXiv ID | Title | Why it matters for tab-foundry | Source |
|---|---|---|---|
| 2203.15556 | Training Compute-Optimal Large Language Models (Chinchilla) | Direct methodology reference for fitting compute-optimal curves. The primary template for the scaling-law measurement work (TF-RD-009). | https://arxiv.org/abs/2203.15556 |
| 2001.08361 | Scaling Laws for Neural Language Models | Foundational; establishes power-law relationships between compute budget, dataset size, and model parameters. | https://arxiv.org/abs/2001.08361 |
| 2505.13738 | Power Lines: Scaling Laws for Language Models | Scaling law methodology refinements; directly relevant to building the scaling predictability measurement infrastructure. | https://arxiv.org/abs/2505.13738 |
| 2305.16264 | Scaling Data-Constrained Language Models | Relevant when synthetic data budget is the bottleneck; analyzes how data repetition and mixing affects scaling behavior. | https://arxiv.org/abs/2305.16264 |
| 2210.14891 | Broken Neural Scaling Laws | Explains why scaling isn’t always a simple power law; critical for diagnosing knees and plateaus in scaling curves. | https://arxiv.org/abs/2210.14891 |
External Repo References
Adjacent repo references that inform architecture and training decisions. Detailed borrowing rules live below.
| Repo | Link | Why it matters |
|---|---|---|
| nanoTabPFN | https://github.com/automl/nanoTabPFN | Benchmark comparison target; training recipe choices and model sizing decisions provide direct baseline for tab-foundry. |
| nanochat | https://github.com/karpathy/nanochat | Optimizer, LR schedule, model sizing, and clean training loop patterns. Reference for compact transformer training infrastructure. |
| Muon optimizer | https://github.com/KellerJordan/Muon | Modern optimizer treating weights as orthogonal matrices. Relevant to later training-recipe work on the classification-first sandwich target (TF-RD-009). |
External Baseline Borrowing Rules
Default transfer rule for repo references:
- borrow compact-transformer recipe ideas when they do not depend on sequence order
- prefer set- and permutation-aware references for row and column structure
- treat language-sequence positional machinery as low priority by default
nanoTabPFN
- URL: https://github.com/automl/nanoTabPFN
- Roadmap relevance: TF-RD-001 (PFN control lane), TF-RD-008 (anchor promotion)
What it does well:
- clean, minimal TabPFN training implementation with good defaults
- establishes a concrete performance baseline for short-run tabular PFN training
- model sizing choices provide a known-good starting point
What to absorb:
- training recipe as a named baseline config
- depth/width/head sizing conventions as a reference point for model size sweeps
- evaluation protocol for direct comparison
Success signal:
tab-foundryreaches parity withnanoTabPFNon the same benchmark suite and training budget- at least one
tab-foundryconfig is directly derived fromnanoTabPFN’s recipe and runs through the benchmark pipeline
nanochat
- URL: https://github.com/karpathy/nanochat
- Roadmap relevance: TF-RD-002 (measurement surfaces), TF-RD-007 (QASS attribution), TF-RD-009 (training and scaling work on the classification-first sandwich target)
What it does well:
- minimalist, high-quality reference for modern transformer training pipelines
- clean integration of Muon with a standard training loop
- strong compact-transformer backbone hygiene: pre-norm residual layout, simple block structure, and small-model sizing discipline
- readable model definition and training loop that separates concerns well
- good LR schedule implementation with warmup and cosine decay
What to absorb:
ReLU^2as a concrete compact-transformer FFN candidate instead of defaulting automatically to GELU- pre-norm residual simplicity and readable block structure where it fits tabular modeling
- Muon plus AdamW parameter-group partitioning
- warmup and cosine schedule patterns
- depth/width/head scaling conventions
- value/token encoding ideas only where they translate cleanly to tabular value tokenization
What not to copy by default:
- causal-LM assumptions
- sequence-order positional encodings such as RoPE
- autoregressive masking or next-token machinery whose value depends on token order
- mechanisms whose main benefit is language-style sequence structure rather than unordered row/column interactions
Success signal:
- the
tab-foundrytraining loop reaches comparable clarity and separation of concerns - Muon integration works correctly for tabular parameter groups
- schedule patterns from
nanochatare available as named options in sweep infrastructure - a compact tabular baseline can resemble
nanochat’s recipe choices while still using set-structured inductive biases
Set / Permutation-Aware Priority Note
- Type: paper-backed priority rule rather than a single external repo
- Roadmap relevance: TF-RD-004 (tokenization migration), TF-RD-005 (row-embedding unlock), TF-RD-007 (QASS attribution)
Why it matters:
- rows and columns in tabular models are much closer to sets than to text sequences
- when repo references and paper references disagree, set- and permutation-aware work should usually win for row/column structure
- this is the main reason to elevate
Deep SetsandSet Transformerahead of generic LLM positional tricks
What to absorb:
- permutation invariance as the default test for row/column mechanisms
- set-style attention patterns before language-style positional patterns
- factorized or late interaction ideas only as benchmark-first hypotheses, not as default structure
Success signal:
- architecture ablations are framed in terms of set structure and interaction patterns rather than porting LLM sequence machinery
- new row/column blocks justify any positional assumptions explicitly instead of inheriting them by default
Muon Optimizer
- URL: https://github.com/KellerJordan/Muon
- Roadmap relevance: TF-RD-009 (training work on the classification-first sandwich target)
What it does well:
- treats weight matrices as elements of the orthogonal group
- has shown faster convergence and lower post-warmup variance than AdamW in compact transformer settings
What to absorb:
- apply Muon to weight matrices only, with embeddings and biases staying on AdamW
- use Muon’s default learning rate and momentum as sweep starting points
Success signal:
- Muon runs correctly on
tab-foundrywith proper parameter-group assignment - optimizer sweeps compare Muon and AdamW under matched budgets
Optimizer Watchlist
These are intentionally not yet curated as primary-source references:
Polar Express: keep as a watchlist item until a primary paper or official technical source is available- general rule: expensive optimizers are more viable here than in frontier-scale LLM settings because the target transformer family is small
- watchlist items should graduate into the curated reference list only after a primary source is available
Usage Contract
- Major architecture tickets should cite the relevant references from this directory.
- Each reference note should say why it matters and what signal would count as success or failure.
- New papers should be added here before they inform architecture changes (literature-first construction).
- New entries should say whether they are
likely adopt,benchmark first, orprobably low relevance. - Cross-reference with
reference/evidence.mdfor roadmap item mappings.