How It Works
This guide explains dagzoo end-to-end with enough detail to reason
about behavior without needing to jump between many documents.
Who this is for
- End users running
dagzoo generateanddagzoo benchmark - Contributors building a mental model before reading implementation files
Mental model in 90 seconds
dagzoo synthesizes tabular datasets by sampling causal structure,
executing randomized mechanisms over that structure, and enforcing
quality and realism controls.
- Resolve config and hardware context for the command.
- Derive deterministic seeds for run, dataset, and component scopes.
- Sample a dataset layout (feature types, assignments, graph size bounds).
- Sample a DAG and node assignments.
- Execute node pipelines in topological order to produce latent outputs.
- Convert latent outputs into observable
Xandy. - Apply split checks, postprocess transforms, and optional missingness.
- Emit
DatasetBundleoutputs; optionally persist shards and diagnostics. - Optionally run
dagzoo filteras a deferred acceptance stage over shards.
Core Concepts
1. Causal DAG vs tabular layout
The generation graph is a latent DAG, while emitted columns are a tabular projection of that latent graph.
- Latent nodes represent abstract causal variables.
- Feature/target columns are assigned to nodes by sampled layout state.
- Multiple columns can map to one node, and one node can influence many columns.
- This decoupling allows rich causal interactions while preserving a clean acyclic execution graph.
flowchart LR
%% Class Definitions
classDef latent fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#01579b,stroke-dasharray: 5 5
classDef observable fill:#f5f5f5,stroke:#212121,stroke-width:2px,color:#212121
subgraph LatentSpace [Latent Causal DAG]
A((Node A)) --> B((Node B))
A --> C((Node C))
end
subgraph ObservableSpace [Tabular Layout]
F1[feature_0 num]
F2[feature_1 cat]
F3[feature_2 num]
T[target]
end
%% Mapping connections
A -. mapping .-> F1
A -. mapping .-> F2
B -. mapping .-> F3
C -. mapping .-> T
%% Assign Classes
class A,B,C latent
class F1,F2,F3,T observable
style LatentSpace fill:#f0faff,stroke:#01579b,stroke-dasharray: 5 5
style ObservableSpace fill:#fafafa,stroke:#2121212. Reproducibility tree
Reproducibility is driven by KeyedRng, which maps one base seed onto
named semantic subtrees.
- One run seed fans out into deterministic run, dataset, layout, split, missingness, noise, and benchmark subtrees.
- Canonical bundle replay uses
seedtogether withdataset_index/run_num_datasets, while exact internal subtree replay usesmetadata.keyed_replay. dataset_seedremains a stable child-seed identifier for deferred filter and diagnostics compatibility; it is not the exact keyed runtime root.- Changing one component path should not perturb unrelated component randomness.
Illustrative derivation chain:
KeyedRng(run_seed)
-> keyed("rows")
-> keyed("plan_candidate", attempt, "layout")
-> keyed("plan_candidate", attempt, "execution_plan")
-> keyed("dataset", i, "noise_runtime")
-> keyed("dataset", i, "attempt", attempt, "split")
-> keyed("dataset", i, "attempt", attempt, "postprocess")
-> keyed("dataset", i, "attempt", attempt, "missingness")
3. Split validity retries and deferred filter stage
Generation retries cover split-validity and generation exceptions only.
- Retries are bounded by
filter.max_attempts(retry budget reused from config). - Emitted metadata records
attempt_usedand generation-attempt counters. - Generated outputs mark
metadata.filter.mode=deferredandmetadata.filter.status=not_run.
Data-quality acceptance is a separate stage:
- Run
dagzoo filter --in <shard_dir> --out <out_dir>. - Backend: CPU ExtraTrees-based wins-ratio filter.
- Replay config is taken from embedded shard metadata; artifacts without the required embedded metadata are rejected.
- Deferred runs emit acceptance/rejection artifacts and optional curated shards.
4. Effective config and traceability
Generation and benchmark commands resolve effective config through staged overrides, then validate constraints.
Generate path (high-level):
- Base config YAML
- Device override normalization
- Hardware policy application
- Missingness/diagnostics CLI overrides
- Final generation constraint validation
Every run writes:
effective_config.yamleffective_config_trace.yaml
The trace is field-level provenance (path, source, old_value,
new_value) for resolved settings.
5. Hardware-aware execution semantics
dagzoo tracks three related but distinct runtime notions:
requested_device: normalized user intent (auto,cpu,cuda,mps)resolved_device: backend selected from request/environmentdevice: backend used for dataset execution in that attempt
Notable runtime behavior:
autoresolves to available accelerator first, else CPU.- Backend runtime errors surface directly; generation does not rewrite the resolved device after execution starts.
- Split/postprocess control RNG runs on CPU to avoid tiny-op accelerator overhead.
Mathematical Foundations
Formal equations are canonicalized in transforms.md.
- Canonical equations + implementation map: transforms.md
- Shared notation and symbol definitions: transforms.md#notation-and-symbols
Quick index to the formal sections:
- DAG sampling: strict upper-triangular Bernoulli sampling with Cauchy latent logits and shift-adjusted edge bias.
- Mechanism-family sampling: family-mix weights plus mechanism logit tilt produce runtime family probabilities.
- Node pipeline: root/parent composition, latent sanitization and weighting, converter slicing, and final scaling.
- Converters and noise: numeric/categorical converter equations and dataset-level noise runtime selection (including mixture-mode behavior).
End-to-end flow
This diagram shows command-level orchestration and where generation, benchmarking, and hardware inspection diverge.
flowchart TB
%% Class Definitions
classDef cli fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#01579b
classDef gen fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#e65100
classDef bench fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#4a148c
classDef flt fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px,color:#1b5e20
classDef hw fill:#f1f8e9,stroke:#33691e,stroke-width:2px,color:#33691e
CLI(["dagzoo CLI"])
CLI --> GenCfg
CLI --> BenchCfg
CLI --> Detect
subgraph GeneratePath [generate]
direction TB
subgraph Setup [" "]
GenCfg[load + resolve config] --> Loop[generate_batch_iter]
Loop --> Seed[derive dataset/component seeds]
end
subgraph Sampling [" "]
Seed --> Layout[sample layout]
Layout --> DAG[sample DAG + assignments]
DAG --> Resolve[resolve shift + noise]
end
subgraph Execution [" "]
Resolve --> Exec[run node pipelines]
end
subgraph Emission [" "]
Exec --> Post[postprocess]
Post --> Missing[optional missingness]
Missing --> Bundle[[emit DatasetBundle]]
end
end
subgraph FilterPath [filter]
direction TB
FilterCmd[dagzoo filter] --> Replay[replay ExtraTrees over shards]
Replay --> FilterArtifacts[write manifest + summary + optional curated shards]
end
Bundle --> FilterCmd
subgraph BenchmarkPath [benchmark]
direction TB
BenchCfg[resolve preset/suite configs] --> Runs[run benchmark suite]
Runs --> Guards[emit guardrails]
Guards --> Reports[write summary]
end
subgraph HardwarePath [hardware]
direction TB
Detect[detect_hardware] --> Print[print backend/tier]
end
%% Assign Classes
class CLI cli
class GenCfg,Loop,Seed,Layout,DAG,Resolve,Exec,Post,Missing,Bundle gen
class FilterCmd,Replay,FilterArtifacts flt
class BenchCfg,Runs,Guards,Reports bench
class Detect,Print hw
style Setup fill:transparent,stroke:none
style Sampling fill:transparent,stroke:none
style Execution fill:transparent,stroke:none
style Emission fill:transparent,stroke:noneGeneration pipeline walkthrough
This section maps the runtime to module boundaries and data flow.
1) Entry points and orchestration boundaries
- Public generation APIs live in
src/dagzoo/core/dataset.py. dataset.pyis a façade over focused internals:generation_context.py: seed/split/device/dtype helpersgeneration_runtime.py: shared finalization, stratified split, and postprocess helpersnoise_runtime.py: per-dataset noise runtime selectionfixed_layout/runtime.py: internal canonical run preparation, classification replay validation, and batched execution orchestrationfixed_layout/metadata.py: shared fixed-layout metadata helpers and layout signatures
2) Layout and structure sampling
_sample_layoutsamples feature counts/types, class bounds, and feature/target-to-node assignments.sample_dagsamples strict upper-triangular adjacency.- Adjacency convention is
adjacency[src, dst]; parents of nodejare read from columnadjacency[:, j].
3) Node execution and tensor assembly
- Nodes execute in index/topological order.
- Root nodes sample latent points directly.
- Child nodes consume parent outputs using multi-parent composition:
- 50% path: concatenate parents and apply one mechanism
- 50% path: apply per-parent mechanisms, then aggregate via
sum | product | max | logsumexp
- Converter specs slice latent columns and emit feature/target values.
- Unassigned feature slots are filled with sampled noise.
4) Quality, shift/noise controls, and postprocessing
- Shift runtime params modulate graph/mechanism/noise behavior when enabled.
- Noise runtime resolution picks one family per dataset in mixture mode, then propagates through node-level samplers.
- Split, postprocess, and missingness run in-generation.
- Classification split validity is enforced before bundle emission.
Canonical postprocess behavior:
- Public generation preserves emitted schema across a canonical run.
- Classification runs may validate the requested run up front before the first bundle is emitted so later dataset seeds cannot fail after partial output.
5) Metadata and output emission
Each bundle includes runtime metadata for lineage, deferred-filter status, shift, noise distribution, and resolved config snapshot.
lineagealigns emitted columns with DAG node assignments.requested_device,resolved_device, and the reserveddevice_fallback_reasonfield are emitted for runtime observability.- Canonical generation outputs add
layout_mode,layout_plan_seed,layout_signature,dataset_seed, andkeyed_replay.
DAG/node data flow
This diagram focuses on node-level execution mechanics inside the canonical generation runtime.
flowchart TB
%% Class Definitions
classDef setup fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#01579b
classDef loop fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#e65100
classDef pipe fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#4a148c
classDef out fill:#f1f8e9,stroke:#33691e,stroke-width:2px,color:#33691e
Setup[layout + DAG + node specs] --> Walk[for node in topological order]
Walk --> Root{root node?}
Root -- yes --> RootSample[sample_random_points]
Root -- no --> ParentMix[compose parent outputs]
RootSample --> Pipe
ParentMix --> Pipe
subgraph Pipe [apply_node_pipeline]
Clean[nan_to_num + clamp] --> Std[standardize]
Std --> Weight[random weights + normalize]
Weight --> Slice[slice latent per spec]
Slice --> Convert[apply converter]
Convert --> Feedback[write output back]
Feedback --> Scale[final scaling]
end
Pipe --> Emit[emit node output + extracted values]
Emit --> Walk
Emit --> Assemble[assemble final X/y]
Assemble --> Out[[return tensors + deferred filter metadata]]
%% Assign Classes
class Setup setup
class Walk,Root,RootSample,ParentMix loop
class Clean,Std,Weight,Slice,Convert,Feedback,Scale,Pipe pipe
class Emit,Assemble,Out outDiagnostics, fixed layout, and benchmark guardrails
These are related but distinct runtime surfaces.
- Canonical fixed-layout generation controls structural consistency across emitted datasets.
- Diagnostics aggregates observability metrics across emitted bundles.
- Benchmark guardrails evaluate runtime/metadata regressions in suite runs.
Glossary quick reference
- layout: sampled feature/task/assignment scaffold for one dataset.
- DAG adjacency: upper-triangular parent-child matrix,
src -> dst. - node pipeline: per-node transform and converter execution path.
- converter spec: instruction for extracting observable feature/target slices.
- deferred filter: ExtraTrees-based post-generation gate for signal quality.
- wins ratio: bootstrap fraction where model beats baseline.
- shift runtime params: resolved graph/mechanism/noise drift controls.
- noise runtime selection: per-dataset resolved noise family/params.
- fixed-layout plan: internal sampled layout/execution payload reused within one canonical run.
- layout signature: deterministic hash fingerprint of a sampled layout.
- DatasetBundle: in-memory output container with tensors + metadata.
- effective config trace: field-level override provenance artifact.
Where to go next
- Canonical transform equations and symbol definitions:
docs/transforms.md - Output schema and metadata contract:
docs/output-format.md - Config precedence and trace details:
docs/development/config-resolution.md - CLI workflow examples:
docs/usage-guide.md - Architecture rationale:
docs/development/design-decisions.md