Start

Install dagzoo, inspect curated recipes, and generate the first reproducible run.

This is the fastest path from install to usable synthetic tabular data.

The public entrypoint is the curated recipe catalog. Start with recipe:<name> when you want something reproducible, discoverable, and easy to cite. Move to repo-local configs/ only when you need custom authoring beyond the published recipes. When you want to share a generated corpus, prefer --handoff-root over a plain --out directory because the handoff layout is the stable downstream publish surface.


1. Install

Packaged install:

uv tool install dagzoo

Repo checkout:

./scripts/dev bootstrap
source .venv/bin/activate

2. Inspect the recipe catalog

dagzoo recipe list

That command prints the stable recipe names and the regime each one is meant to approximate or stress. The catalog is the default adoption layer for dagzoo. All shipped recipes use the same public prior shape: sample a latent DAG, emit features from node-assigned converters, emit y from one selected latent node, then optionally apply missingness as a separate observation process over the emitted features.


3. Generate your first run

Balanced baseline with the default latent-node target prior:

dagzoo generate --config recipe:default-baseline --num-datasets 25 --out data/default_baseline

TabPFN-inspired numeric-heavy latent-node prior:

dagzoo generate --config recipe:tabpfn-v1-prior-approx --num-datasets 25 --out data/tabpfn_prior

Every generate run writes:

  • effective_config.yaml
  • effective_config_trace.yaml

dagzoo generate only generates. If you want accept/reject decisions, run dagzoo filter as a separate replay stage over the emitted shards. The same filter thresholds also guide generation-time structural retries. Observational generation is the default. Hard interventions are opt-in through intervention.mode: hard_interventional; when that section is absent, public artifacts omit intervention metadata entirely.


4. Use the same recipe in process

from dagzoo import build_dataloader

loader = build_dataloader(
    "recipe:default-baseline",
    num_datasets=10,
    seed=7,
    device="cpu",
)
sample = next(iter(loader))

build_dataloader(...) is the recommended programmatic entrypoint. It uses the same config surface as the CLI: either recipe:<name> or a YAML path. Public generation defaults to fully heterogeneous per-dataset layouts; set runtime.layout_mode: stratified when you want large-run heterogeneous generation to batch compatible (n_rows, n_features) strata without forcing a shared layout. On Apple hardware, heterogeneous and stratified device=auto now prefer CPU over MPS; pass device="mps" explicitly when you want the MPS backend.


5. Publish to Hugging Face Hub

Generate a handoff root when you want a portable corpus layout that can be published directly to Hugging Face Hub:

dagzoo generate \
  --config recipe:default-baseline \
  --num-datasets 25 \
  --handoff-root handoffs/default_baseline

hf auth login
dagzoo publish hub \
  --handoff-root handoffs/default_baseline \
  --repo-id your-name/default-baseline-corpus

Only the public handoff artifacts are uploaded. Local internal/ sidecars stay on disk.

Detailed guide: publish-hub.md


6. Where to go next