tab-foundry
A tabular foundation model that generates the data it learns from.
Most tabular foundation models learn from a fixed corpus and stop. You get a
.predict() call and a benchmark number, but no control over the data the
model trained on, the architecture it uses, or the training loop that produced
it.
tab-foundry takes a different approach. It uses dagzoo to generate synthetic tabular datasets, trains them with an active sandwich lane plus a frozen PFN control and historical staged reference lane, benchmarks against real-world tasks, and exports inference bundles you can deploy. You control the full pipeline: what data gets generated, which model surface is active, how training runs, and what gets exported.
Read The Docs
The published docs are the fastest route to workflows, architecture, and research context: bensonlee5.github.io/tab-foundry.
- New to the repo: Getting Started
- Working on artifacts, runs, export, or verification: ML Engineering
- Working on architecture, sweeps, or synthetic data: Research
How It Works
graph LR
A[dagzoo<br><i>generate</i>] --> B[manifest<br><i>prepare</i>]
B --> C[model family<br><i>train</i>]
C --> D[benchmark<br><i>evaluate</i>]
D --> E[export<br><i>bundle</i>]
D -.->|curriculum feedback<br>planned| A
classDef default fill:#f8f9fa,stroke:#495057,stroke-width:1.5px,color:#212529
classDef planned fill:#f8f9fa,stroke:#adb5bd,stroke-width:1px,stroke-dasharray:5 5- Generate synthetic tabular datasets with dagzoo, or bring your own real-data manifests
- Train the active sandwich lane while preserving frozen and historical comparison surfaces
- Benchmark against pinned OpenML evaluation manifests with tracked source-bundle provenance and baselines
- Export inference bundles for downstream deployment
- (Planned) Close the loop: the model tells dagzoo what harder data it needs next
Quick Start
# Clone and bootstrap
git clone https://github.com/bensonlee5/tab-foundry.git
cd tab-foundry
./scripts/dev bootstrap
# Run a smoke training loop
.venv/bin/tab-foundry train run experiment=cls_smoke
# Evaluate the checkpoint
.venv/bin/tab-foundry eval checkpoint \
--checkpoint outputs/cls_smoke/checkpoints/best.pt \
experiment=cls_smoke
For setup details and runbook examples, see docs/workflows.md and CONTRIBUTING.md. For the docs-first view of the same paths, start at the published docs site.
Python 3.14 is the pinned runtime for this repo, and the standard local setup
assumes a repo-local .venv.
Workflow Surfaces
tab-foundry is the canonical packaged CLI. Use ./scripts/dev as the fast
repo-local path for bootstrap, verification, and Iris smoke; keep
scripts/bench/ reserved for narrow internal benchmark helper workflows.
Manifest build, inspect, and read ownership lives upstream in
tab-realdata-hub. In this repo, the parquet manifest is treated as the
stable index layer and the richer per-dataset semantics live in
metadata.ndjson; tab-foundry consumes that contract and does not define a
parallel manifest parser.
| Surface | Use it for |
|---|---|
tab-foundry | Canonical packaged CLI for data, training, evaluation, export, benchmark, and research workflows. |
./scripts/dev | Fast repo-local bootstrap, doctor, review, verification, and Iris smoke flows. |
scripts/bench/ | Standalone internal benchmark helper entrypoints that stay outside the packaged CLI. |
Use --help in this order:
.venv/bin/tab-foundry --help.venv/bin/tab-foundry <group> --help.venv/bin/tab-foundry <group> <command> --help
| Namespace | Purpose | Read next |
|---|---|---|
data | Corpus recipes, corpus materialization, and manifest inspection. | docs/workflows.md |
dev | Fast inspection and verification surfaces for local development. | docs/workflows.md |
train, eval, export | Manifest-backed training, checkpoint evaluation, and inference-bundle workflows. | docs/workflows.md |
bench | Smoke harnesses, benchmark comparisons, and baseline-registry flows. | docs/workflows.md |
research | Sweep queues, inspection, execution, and sweep-aware corpus materialization. | program.md |
Use docs/workflows.md for representative commands and docs/development/codebase-navigation.md for package ownership and entry points.
What Makes This Different
Full pipeline control. Data generation, architecture selection, training, benchmarking, and export are all in one repo. You own the entire stack, not just the prediction API.
Synthetic data engine. Dagzoo generates tabular datasets with controlled shape, complexity, and regime coverage. You decide what the model trains on rather than hoping a fixed corpus covers your use case.
Modular architecture family. The repo keeps an active sandwich lane, a frozen PFN control, and a historical staged reference surface so subsystems and regimes can be compared without losing attribution.
What Works Today
- Active sandwich architecture, with a frozen nanoTabPFN-style control lane for trusted comparison and a historical staged family retained as a reference surface
- Dagzoo integration for synthetic corpus generation, manifests, and materialization
- OpenML benchmarking against manifest-backed binary and multiclass evaluation surfaces with a tracked benchmark registry
- Research sweep framework for systematic architecture and data-surface experiments with full attribution
- Export pipeline for packaging inference bundles
- Evidence-backed decisions every architecture choice has a pinned benchmark, sweep result, and research card
What We’re Building
- Active learning loop where the model requests harder synthetic data from dagzoo based on its weaknesses
- Pluggable data sources with a unified interface for synthetic and real datasets
- Curriculum control so users can design training progressions instead of filtering data
- Distributed training across checkpoints contributed by different users
- Perpetually evolving model that improves as the community contributes compute and data
Architecture at a Glance
The active development family (tabfoundry_sandwich) is a fixed-latent hybrid
full-cell / summary-stream Perceiver classifier:
graph TD
A[input table] --> B[shared normalization +<br>cell tokenizer]
B --> C[full cell stream]
B --> D[row + column<br>summary streams]
C --> E[stage 0 latent read]
D --> F[later latent refinement]
E --> F
F --> G[test-row readout]
G --> H[class head]
classDef default fill:#f8f9fa,stroke:#495057,stroke-width:1.5px,color:#212529A frozen nanoTabPFN control lane (tabfoundry_simple) preserves benchmark
comparability, and tabfoundry_staged remains loadable as the historical
reference family. For the full architecture reference, see
docs/development/model-architecture.md.
At the cell level, the active sandwich lane uses a missingness-aware tokenizer
over value, is_nan, is_posinf, and is_neginf, then applies a shared
value projection with feature-type conditioning plus Fourier row and column
enrichment before the task-level attention stack.
The current architecture-development lane is classification-only. By default,
training is ranked by matched-budget final log loss
(final_log_loss_at_matched_regime_budget). When
training.loss_surface=cell_bpc, the objective switches to matched-budget
final BPC (final_bpc_at_matched_regime_budget).
Find Your Path
| If you want to… | Start here | Then go deeper |
|---|---|---|
| Understand the current model surface | Model architecture | Roadmap |
| Run research sweeps | Research program | Workflows |
| Work on artifacts or infra | Workflows | Inference & export |
| Change package wiring or entry points | Codebase navigation | Contributing |
Resources
- Published docs site for the fastest route to workflows, architecture, and research context
- Problem formulation
for the mathematical statement of the
dagzooprior and sandwich training objectives - Roadmap for what’s active, planned, and completed
- Architecture reference for the full model surface
- Workflows for exact command syntax and artifact expectations
- Glossary for shared vocabulary
License and Contributing
tab-foundry is released under the Apache License 2.0.
Contributions are welcome. See CONTRIBUTING.md for the development workflow, code standards, and how to run the test suite.