Artifacts & API
Consumer-facing specification for generated data. This is a contract document: downstream users can rely on the guarantees described here.
If you are new to dagzoo, start with start.md and
reference-packs.md first. Come back here when you need to
rely on what the public CLI or PyTorch bridge writes and returns.
Public config references accepted by the main user-facing surfaces are:
- a YAML path
- a curated recipe reference in the form
recipe:<name>
That contract applies to dagzoo generate, dagzoo benchmark --preset custom,
dagzoo diversity-audit, and build_dataloader(...).
This page is the readable overview of that contract. The exhaustive
field-by-field catalog lives in
export-contract-fields.md and is generated from
reference/export_contract_inventory.yaml.
DatasetBundle (in-memory)
Each generated dataset is returned as a DatasetBundle with these fields:
| Field | Type | Shape |
|---|---|---|
X_train | torch.Tensor (float32 or float64) | (n_train, n_features) |
y_train | torch.Tensor | (n_train,) |
X_test | torch.Tensor (float32 or float64) | (n_test, n_features) |
y_test | torch.Tensor | (n_test,) |
feature_types | list[str] | length n_features |
metadata | dict[str, Any] | — |
Target dtype is int64 for classification and floating-point for regression.
Feature dtype matches the configured torch dtype.
metadata overview
DatasetBundle.metadata is the stable in-process metadata payload. Common
top-level keys include:
- runtime and identity fields such as
device,requested_device,resolved_device,dataset_index,dataset_id,dataset_seed, andrun_num_datasets - semantic summaries such as
prior,lineage,shift,intervention,noise_distribution,generation_attempts, andfilter - optional task/runtime summaries such as
class_structure,missingness,split_groups,keyed_replay, andmechanism_families - the resolved generator
configsnapshot
The exhaustive recursive contract for metadata.* lives in
export-contract-fields.md.
Observational bundles omit metadata.intervention. Hard-interventional bundles
add only the summary object {mode, signature} at the top level; the full
authored selector payload remains in effective_config.yaml rather than
metadata.config.
metadata.prior sub-object
Present for all generated bundles.
| Key | Type | Description |
|---|---|---|
target_derivation | str | Current target-construction contract marker |
feature_generator | str | Feature-generation family |
missingness_stage | str | Stage at which missingness is applied |
classification_validity_policy | str | Classification retry policy |
localization_mode | str | Current localization setting |
n_adaptation | str | Current n-adaptation setting |
Current emitted bundles use:
target_derivation = "tabiclv2_latent_node"feature_generator = "latent_dag"missingness_stage = "post_target_observation"
That means y is emitted by converting one selected latent DAG node. There is
no separate observed-feature target mechanism and no soft-label export surface
in the current public contract.
Feature type encoding
Each entry in feature_types is one of:
"num": continuous feature. After postprocessing,X_trainvalues are clipped and standardized to approximately zero mean and unit variance, and the same train-fit transform is applied toX_test."cat": categorical feature. Observed values are integer indices in the range0 .. cardinality - 1. When missingness is enabled, missing values are encoded asNaN.
feature_types[i] describes column index i in X_train and X_test.
On-Disk Directory Structure
Plain dagzoo generate --out ... runs write:
out_dir/
effective_config.yaml
effective_config_trace.yaml
shard_00000/
train.parquet
test.parquet
dataset_catalog.parquet
shard_00001/
...
internal/
shard_00000/
replay_catalog.parquet
lineage/
adjacency.bitpack.bin
adjacency.index.json
shard_00001/
...
shard_* directories are the stable public dataset artifacts. internal/
holds dagzoo-only replay and lineage sidecars used by tooling such as
dagzoo filter; it is not the stable public contract.
Shard naming is shard_{id:05d}. Default shard size is 128 datasets, so the
shard id is dataset_index // shard_size.
Parquet Column Schema
train.parquet and test.parquet both use packed row-wise records:
| Column | Type | Description |
|---|---|---|
dataset_index | int64 | Global dataset index for this row |
row_index | int64 | Row index within the dataset split |
x | list[float32/float64] | Full feature vector for this row |
y | int64 or float | Target value for this row |
Compression is zstd by default.
Dataset Catalog Parquet
Each public shard writes one dataset_catalog.parquet file with one row per
dataset. The row stores the canonical semantic payload in record_json, plus a
checksum and resolved scalar columns that downstream tooling can filter without
re-parsing every record.
| Column | Type | Description |
|---|---|---|
dataset_index | int64 | Shard-local dataset index |
record_json | large_string | Canonical JSON payload for the dataset catalog record |
record_sha256 | string | SHA-256 of record_json |
resolved_dataset_id | string | null |
resolved_request_run | string | null |
resolved_task | string | classification or regression |
resolved_n_train | int64 | Train row count |
resolved_n_test | int64 | Test row count |
resolved_n_features | int64 | Emitted feature count |
resolved_n_classes | int64 | null |
resolved_filter_mode | string | null |
resolved_filter_status | string | null |
resolved_filter_accepted | bool | null |
teacher_conditionals_available | bool | Whether teacher conditionals were available for the dataset |
The semantic payload embedded in record_json carries the stable per-dataset
fields used by downstream consumers, including dataset_id, task,
group_ids, intervention, target_derivation, and target_relevance.
Generate Handoff Layout (dagzoo generate --handoff-root)
Generate handoff runs use the supplied handoff root as a stable downstream
entrypoint. This is the public layout consumed by dagzoo publish hub:
handoff_root/
handoff_manifest.json
generated/
shard_00000/
train.parquet
test.parquet
dataset_catalog.parquet
internal/
effective_config.yaml
effective_config_trace.yaml
shard_00000/
replay_catalog.parquet
lineage/
adjacency.bitpack.bin
adjacency.index.json
curated/
... # optional, written later by dagzoo filter
generated/ reuses the same public shard contract described above.
internal/ remains dagzoo-only. replay_catalog.parquet stores the full
per-dataset metadata payload, including the same summary-only intervention
object when present.
When you publish a handoff root to Hugging Face Hub, dagzoo uploads only the
public handoff artifacts:
generated/curated/when presenthandoff_manifest.json- a generated root
README.mddataset card
internal/ stays local and is never uploaded.
handoff_manifest.json
handoff_manifest.json uses this versioned top-level contract:
| Key | Type | Description |
|---|---|---|
schema_name | str | Exact string dagzoo_generate_handoff_manifest |
schema_version | int | Exact integer 5 |
identity | object | Stable generate-run and corpus ids plus source-family tag |
artifacts_relative | object | Manifest-relative artifact paths for portable downstream consumption |
summary | object | Generated dataset count |
provenance | object | Optional generated-corpus provenance summary derived from the public dataset catalog |
Current identity keys:
source_familygenerate_run_idgenerated_corpus_id
Current identity.source_family values:
dagzoo.heterogeneous_scmdagzoo.fixed_layout_scm
Current artifacts_relative keys:
generated_dircurated_dir(optional; present only after a curated corpus exists)
Current summary keys:
generated_datasets
Current provenance keys:
intervention(optional)target_derivationtarget_relevant_feature_count_rangetarget_relevant_feature_fraction_range
Current provenance.intervention keys:
modesignature
provenance.intervention is omitted for observational generated corpora.
Lineage Schema
Schema name: dagzoo.dag_lineage
This page documents the current lineage contract:
Version 1.4.0 (dense, in-memory)
Used in DatasetBundle.metadata["lineage"] during generation.
{
"schema_name": "dagzoo.dag_lineage",
"schema_version": "1.4.0",
"graph": {
"n_nodes": 8,
"adjacency": [[0, 1, 0], [0, 0, 1], [0, 0, 0]]
},
"assignments": {
"feature_to_node": [2, 3, 5, 7],
"target_to_node": 6,
"target_relevant_features": [0, 1, 3],
"target_relevant_feature_count": 3,
"target_relevant_feature_fraction": 0.75
}
}
Version 1.5.0 (compact, on-disk)
Used in persisted replay metadata when lineage artifacts are written to disk.
{
"schema_name": "dagzoo.dag_lineage",
"schema_version": "1.5.0",
"graph": {
"n_nodes": 8,
"edge_count": 12,
"adjacency_ref": {
"encoding": "upper_triangle_bitpack_v1",
"blob_path": "lineage/adjacency.bitpack.bin",
"index_path": "lineage/adjacency.index.json",
"dataset_index": 0,
"bit_offset": 0,
"bit_length": 28,
"sha256": "a1b2c3..."
}
},
"assignments": {
"feature_to_node": [2, 3, 5, 7],
"target_to_node": 6,
"target_relevant_features": [0, 1, 3],
"target_relevant_feature_count": 3,
"target_relevant_feature_fraction": 0.75
}
}
Adjacency encoding: upper_triangle_bitpack_v1
- Packs the
n_nodes * (n_nodes - 1) / 2upper-triangle bits into bytes. - Bit order is little-endian.
bit_offsetandbit_lengthlocate one dataset’s adjacency bits inside the shared shard-level blob.sha256is a hex-encoded SHA-256 checksum of the packed bytes for that dataset’s adjacency data.
Each shard also contains lineage/adjacency.index.json with schema identifiers,
the encoding name, and the per-dataset offset/length/checksum entries. Those
artifacts live in the shard’s lineage/ directory.
Diagnostics Coverage Summary Artifacts
When diagnostics are enabled, the run root also includes:
coverage_summary.jsoncoverage_summary.md
These artifacts summarize corpus-level coverage and do not alter the public
parquet or dataset_catalog.parquet contract.
The exhaustive field list for diagnostics summaries lives in export-contract-fields.md.
Contract Guarantees
Determinism: seed derivation is deterministic. For a fixed seed and configuration, runs are expected to reproduce metadata and numerical outputs within tolerance. Strict byte-identical tensors/files are not guaranteed across all backends.
Feature alignment: feature_types[i] describes feature index i inside
packed parquet row vectors and tensor column index i in X_train /
X_test.
Target semantics: the current public contract derives y from one selected
latent DAG node and applies missingness afterward as an observation process
over emitted features.
Lineage integrity: each dataset’s bitpacked adjacency data is protected by a SHA-256 checksum recorded in the compact lineage payload.
Postprocessing invariants:
- Default public generation may vary emitted feature schema across one run.
- Stratified mode (
runtime.layout_mode: stratified) still preserves heterogeneous semantics; constant-column removal and feature-column permutation remain dataset-local even when compatible strata are batched. - Numeric features are clipped and standardized using statistics fit on the emitted training split, then applied unchanged to the test split.
- Regression targets are clipped and standardized using statistics fit on the emitted training split, then applied unchanged to the test split.
- Classification target classes are randomly permuted; label indices carry no ordinal meaning.