tab-foundry

Top-level repo overview, docs routing, and quickstart.

A tabular foundation model that generates the data it learns from.

License: Apache-2.0 Python 3.14 Docs

Most tabular foundation models learn from a fixed corpus and stop. You get a .predict() call and a benchmark number, but no control over the data the model trained on, the architecture it uses, or the training loop that produced it.

tab-foundry takes a different approach. It uses dagzoo to generate synthetic tabular datasets, trains them with an active sandwich lane plus a frozen PFN control and historical staged reference lane, benchmarks against real-world tasks, and exports inference bundles you can deploy. You control the full pipeline: what data gets generated, which model surface is active, how training runs, and what gets exported.

Read The Docs

The published docs are the fastest route to workflows, architecture, and research context: bensonlee5.github.io/tab-foundry.

How It Works

graph LR
    A[dagzoo<br><i>generate</i>] --> B[manifest<br><i>prepare</i>]
    B --> C[model family<br><i>train</i>]
    C --> D[benchmark<br><i>evaluate</i>]
    D --> E[export<br><i>bundle</i>]
    D -.->|curriculum feedback<br>planned| A

    classDef default fill:#f8f9fa,stroke:#495057,stroke-width:1.5px,color:#212529
    classDef planned fill:#f8f9fa,stroke:#adb5bd,stroke-width:1px,stroke-dasharray:5 5
  1. Generate synthetic tabular datasets with dagzoo, or bring your own real-data manifests
  2. Train the active sandwich lane while preserving frozen and historical comparison surfaces
  3. Benchmark against pinned OpenML evaluation manifests with tracked source-bundle provenance and baselines
  4. Export inference bundles for downstream deployment
  5. (Planned) Close the loop: the model tells dagzoo what harder data it needs next

Quick Start

# Clone and bootstrap
git clone https://github.com/bensonlee5/tab-foundry.git
cd tab-foundry
./scripts/dev bootstrap

# Run a smoke training loop
.venv/bin/tab-foundry train run experiment=cls_smoke

# Evaluate the checkpoint
.venv/bin/tab-foundry eval checkpoint \
  --checkpoint outputs/cls_smoke/checkpoints/best.pt \
  experiment=cls_smoke

For setup details and runbook examples, see docs/workflows.md and CONTRIBUTING.md. For the docs-first view of the same paths, start at the published docs site.

Python 3.14 is the pinned runtime for this repo, and the standard local setup assumes a repo-local .venv.

Workflow Surfaces

tab-foundry is the canonical packaged CLI. Use ./scripts/dev as the fast repo-local path for bootstrap, verification, and Iris smoke; keep scripts/bench/ reserved for narrow internal benchmark helper workflows.

Manifest build, inspect, and read ownership lives upstream in tab-realdata-hub. In this repo, the parquet manifest is treated as the stable index layer and the richer per-dataset semantics live in metadata.ndjson; tab-foundry consumes that contract and does not define a parallel manifest parser.

SurfaceUse it for
tab-foundryCanonical packaged CLI for data, training, evaluation, export, benchmark, and research workflows.
./scripts/devFast repo-local bootstrap, doctor, review, verification, and Iris smoke flows.
scripts/bench/Standalone internal benchmark helper entrypoints that stay outside the packaged CLI.

Use --help in this order:

  1. .venv/bin/tab-foundry --help
  2. .venv/bin/tab-foundry <group> --help
  3. .venv/bin/tab-foundry <group> <command> --help
NamespacePurposeRead next
dataCorpus recipes, corpus materialization, and manifest inspection.docs/workflows.md
devFast inspection and verification surfaces for local development.docs/workflows.md
train, eval, exportManifest-backed training, checkpoint evaluation, and inference-bundle workflows.docs/workflows.md
benchSmoke harnesses, benchmark comparisons, and baseline-registry flows.docs/workflows.md
researchSweep queues, inspection, execution, and sweep-aware corpus materialization.program.md

Use docs/workflows.md for representative commands and docs/development/codebase-navigation.md for package ownership and entry points.

What Makes This Different

  • Full pipeline control. Data generation, architecture selection, training, benchmarking, and export are all in one repo. You own the entire stack, not just the prediction API.

  • Synthetic data engine. Dagzoo generates tabular datasets with controlled shape, complexity, and regime coverage. You decide what the model trains on rather than hoping a fixed corpus covers your use case.

  • Modular architecture family. The repo keeps an active sandwich lane, a frozen PFN control, and a historical staged reference surface so subsystems and regimes can be compared without losing attribution.

What Works Today

  • Active sandwich architecture, with a frozen nanoTabPFN-style control lane for trusted comparison and a historical staged family retained as a reference surface
  • Dagzoo integration for synthetic corpus generation, manifests, and materialization
  • OpenML benchmarking against manifest-backed binary and multiclass evaluation surfaces with a tracked benchmark registry
  • Research sweep framework for systematic architecture and data-surface experiments with full attribution
  • Export pipeline for packaging inference bundles
  • Evidence-backed decisions every architecture choice has a pinned benchmark, sweep result, and research card

What We’re Building

  • Active learning loop where the model requests harder synthetic data from dagzoo based on its weaknesses
  • Pluggable data sources with a unified interface for synthetic and real datasets
  • Curriculum control so users can design training progressions instead of filtering data
  • Distributed training across checkpoints contributed by different users
  • Perpetually evolving model that improves as the community contributes compute and data

Architecture at a Glance

The active development family (tabfoundry_sandwich) is a fixed-latent hybrid full-cell / summary-stream Perceiver classifier:

graph TD
    A[input table] --> B[shared normalization +<br>cell tokenizer]
    B --> C[full cell stream]
    B --> D[row + column<br>summary streams]
    C --> E[stage 0 latent read]
    D --> F[later latent refinement]
    E --> F
    F --> G[test-row readout]
    G --> H[class head]

    classDef default fill:#f8f9fa,stroke:#495057,stroke-width:1.5px,color:#212529

A frozen nanoTabPFN control lane (tabfoundry_simple) preserves benchmark comparability, and tabfoundry_staged remains loadable as the historical reference family. For the full architecture reference, see docs/development/model-architecture.md.

At the cell level, the active sandwich lane uses a missingness-aware tokenizer over value, is_nan, is_posinf, and is_neginf, then applies a shared value projection with feature-type conditioning plus Fourier row and column enrichment before the task-level attention stack.

The current architecture-development lane is classification-only. By default, training is ranked by matched-budget final log loss (final_log_loss_at_matched_regime_budget). When training.loss_surface=cell_bpc, the objective switches to matched-budget final BPC (final_bpc_at_matched_regime_budget).

Find Your Path

If you want to…Start hereThen go deeper
Understand the current model surfaceModel architectureRoadmap
Run research sweepsResearch programWorkflows
Work on artifacts or infraWorkflowsInference & export
Change package wiring or entry pointsCodebase navigationContributing

Resources

License and Contributing

tab-foundry is released under the Apache License 2.0.

Contributions are welcome. See CONTRIBUTING.md for the development workflow, code standards, and how to run the test suite.