tab-foundry

Top-level repo overview, docs routing, and quickstart.

A tabular foundation model that generates the data it learns from.

Most tabular foundation models learn from a fixed corpus and stop. You get a .predict() call and a benchmark number, but no control over the data the model trained on, the architecture it uses, or the training loop that produced it.

tab-foundry takes a different approach. It uses dagzoo to generate synthetic tabular datasets, trains them with an active sandwich lane plus a frozen PFN control and historical staged reference lane, benchmarks against real-world tasks, and exports inference bundles you can deploy. You control the full pipeline: what data gets generated, which model surface is active, how training runs, and what gets exported.

Read The Docs

The published docs are the fastest route to workflows, architecture, and research context: bensonlee5.github.io/tab-foundry.

New to the repo: Getting Started
Working on artifacts, runs, export, or verification: ML Engineering
Working on architecture, sweeps, or synthetic data: Research

How It Works

graph LR
    A[dagzoo<br><i>generate</i>] --> B[manifest<br><i>prepare</i>]
    B --> C[model family<br><i>train</i>]
    C --> D[benchmark<br><i>evaluate</i>]
    D --> E[export<br><i>bundle</i>]
    D -.->|curriculum feedback<br>planned| A

    classDef default fill:#f8f9fa,stroke:#495057,stroke-width:1.5px,color:#212529
    classDef planned fill:#f8f9fa,stroke:#adb5bd,stroke-width:1px,stroke-dasharray:5 5

Generate synthetic tabular datasets with dagzoo, or bring your own real-data manifests
Train the active sandwich lane while preserving frozen and historical comparison surfaces
Benchmark against pinned OpenML evaluation manifests with tracked source-bundle provenance and baselines
Export inference bundles for downstream deployment
(Planned) Close the loop: the model tells dagzoo what harder data it needs next

Quick Start

# Clone and bootstrap
git clone https://github.com/bensonlee5/tab-foundry.git
cd tab-foundry
./scripts/dev bootstrap

# Run a smoke training loop
.venv/bin/tab-foundry train run experiment=cls_smoke

# Evaluate the checkpoint
.venv/bin/tab-foundry eval checkpoint \
  --checkpoint outputs/cls_smoke/checkpoints/best.pt \
  experiment=cls_smoke

For setup details and runbook examples, see docs/workflows.md and CONTRIBUTING.md. For the docs-first view of the same paths, start at the published docs site.

Python 3.14 is the pinned runtime for this repo, and the standard local setup assumes a repo-local .venv.

Workflow Surfaces

tab-foundry is the canonical packaged CLI. Use ./scripts/dev as the fast repo-local path for bootstrap, verification, and Iris smoke; keep scripts/bench/ reserved for narrow internal benchmark helper workflows.

Manifest build, inspect, and read ownership lives upstream in tab-realdata-hub. In this repo, the parquet manifest is treated as the stable index layer and the richer per-dataset semantics live in metadata.ndjson; tab-foundry consumes that contract and does not define a parallel manifest parser.

Surface	Use it for
`tab-foundry`	Canonical packaged CLI for data, training, evaluation, export, benchmark, and research workflows.
`./scripts/dev`	Fast repo-local bootstrap, doctor, review, verification, and Iris smoke flows.
`scripts/bench/`	Standalone internal benchmark helper entrypoints that stay outside the packaged CLI.

Use --help in this order:

.venv/bin/tab-foundry --help
.venv/bin/tab-foundry <group> --help
.venv/bin/tab-foundry <group> <command> --help

Namespace	Purpose	Read next
`data`	Corpus recipes, corpus materialization, and manifest inspection.	`docs/workflows.md`
`dev`	Fast inspection and verification surfaces for local development.	`docs/workflows.md`
`train`, `eval`, `export`	Manifest-backed training, checkpoint evaluation, and inference-bundle workflows.	`docs/workflows.md`
`bench`	Smoke harnesses, benchmark comparisons, and baseline-registry flows.	`docs/workflows.md`
`research`	Sweep queues, inspection, execution, and sweep-aware corpus materialization.	`program.md`

Use docs/workflows.md for representative commands and docs/development/codebase-navigation.md for package ownership and entry points.

What Makes This Different

Full pipeline control. Data generation, architecture selection, training, benchmarking, and export are all in one repo. You own the entire stack, not just the prediction API.
Synthetic data engine. Dagzoo generates tabular datasets with controlled shape, complexity, and regime coverage. You decide what the model trains on rather than hoping a fixed corpus covers your use case.
Modular architecture family. The repo keeps an active sandwich lane, a frozen PFN control, and a historical staged reference surface so subsystems and regimes can be compared without losing attribution.

What Works Today

Active sandwich architecture, with a frozen nanoTabPFN-style control lane for trusted comparison and a historical staged family retained as a reference surface
Dagzoo integration for synthetic corpus generation, manifests, and materialization
OpenML benchmarking against manifest-backed binary and multiclass evaluation surfaces with a tracked benchmark registry
Research sweep framework for systematic architecture and data-surface experiments with full attribution
Export pipeline for packaging inference bundles
Evidence-backed decisions every architecture choice has a pinned benchmark, sweep result, and research card

What We’re Building

Active learning loop where the model requests harder synthetic data from dagzoo based on its weaknesses
Pluggable data sources with a unified interface for synthetic and real datasets
Curriculum control so users can design training progressions instead of filtering data
Distributed training across checkpoints contributed by different users
Perpetually evolving model that improves as the community contributes compute and data

Architecture at a Glance

The active development family (tabfoundry_sandwich) is a fixed-latent hybrid full-cell / summary-stream Perceiver classifier:

graph TD
    A[input table] --> B[shared normalization +<br>cell tokenizer]
    B --> C[full cell stream]
    B --> D[row + column<br>summary streams]
    C --> E[stage 0 latent read]
    D --> F[later latent refinement]
    E --> F
    F --> G[test-row readout]
    G --> H[class head]

    classDef default fill:#f8f9fa,stroke:#495057,stroke-width:1.5px,color:#212529

A frozen nanoTabPFN control lane (tabfoundry_simple) preserves benchmark comparability, and tabfoundry_staged remains loadable as the historical reference family. For the full architecture reference, see docs/development/model-architecture.md.

At the cell level, the active sandwich lane uses a missingness-aware tokenizer over value, is_nan, is_posinf, and is_neginf, then applies a shared value projection with feature-type conditioning plus Fourier row and column enrichment before the task-level attention stack.

The current architecture-development lane is classification-only. By default, training is ranked by matched-budget final log loss (final_log_loss_at_matched_regime_budget). When training.loss_surface=cell_bpc, the objective switches to matched-budget final BPC (final_bpc_at_matched_regime_budget).

Find Your Path

If you want to…	Start here	Then go deeper
Understand the current model surface	Model architecture	Roadmap
Run research sweeps	Research program	Workflows
Work on artifacts or infra	Workflows	Inference & export
Change package wiring or entry points	Codebase navigation	Contributing

Resources

Published docs site for the fastest route to workflows, architecture, and research context
Problem formulation for the mathematical statement of the dagzoo prior and sandwich training objectives
Roadmap for what’s active, planned, and completed
Architecture reference for the full model surface
Workflows for exact command syntax and artifact expectations
Glossary for shared vocabulary

License and Contributing

tab-foundry is released under the Apache License 2.0.

Contributions are welcome. See CONTRIBUTING.md for the development workflow, code standards, and how to run the test suite.