Output Format

Artifact schema, metadata contract, and shard layout.

Consumer-facing specification for generated data. This is a contract document — downstream users can rely on the guarantees described here.


DatasetBundle (in-memory)

Each generated dataset is returned as a DatasetBundle with these fields:

FieldTypeShape
X_traintorch.Tensor (float32 or float64)(n_train, n_features)
y_traintorch.Tensor(n_train,)
X_testtorch.Tensor (float32 or float64)(n_test, n_features)
y_testtorch.Tensor(n_test,)
feature_typeslist[str]length n_features
metadatadict[str, Any]

Target dtype: int64 for classification, float for regression.

Feature dtype: matches the configured torch dtype (float32 or float64).


Feature type encoding

Each entry in feature_types is one of:

  • "num" — continuous feature. After postprocessing, values are clipped and standardized to approximately zero mean and unit variance.
  • "cat" — categorical feature. Observed values are integer indices in the range 0 .. cardinality - 1. When missingness is enabled, missing values are encoded as NaN.

feature_types[i] describes column index i in X_train and X_test.


On-disk directory structure

out_dir/
  shard_00000/
    train.parquet
    test.parquet
    metadata.ndjson
    lineage/
      adjacency.bitpack.bin
      adjacency.index.json
  shard_00001/
    ...

Shard naming: shard_{id:05d} — five-digit zero-padded shard ID. Default: 128 datasets per shard.

Shard ID calculation: dataset_index // shard_size.


Request-run layout (dagzoo request)

Request-driven runs use output_root as a stable handoff root:

output_root/
  handoff_manifest.json
  generated/
    shard_00000/
    effective_config.yaml
    effective_config_trace.yaml
  filter/
    filter_manifest.ndjson
    filter_summary.json
  curated/
    shard_00000/

generated/, filter/, and curated/ reuse the same shard and summary contracts documented on this page. handoff_manifest.json is the downstream entrypoint for consumers such as tab-foundry.

Request handoff manifest JSON

handoff_manifest.json uses this versioned top-level contract:

KeyTypeDescription
schema_namestrExact string dagzoo_request_handoff_manifest
schema_versionintExact integer 1
requestobjectRequest-file echo with path and normalized public payload
artifactsobjectAbsolute paths for the run root, generated output, filtered corpus, and summaries
summaryobjectGenerated / accepted / rejected counts plus acceptance rate
throughputobjectGeneration-stage and filter-stage elapsed time plus datasets-per-minute context
hardwareobjectRequested/resolved device context plus applied hardware policy
diversity_artifactsobjectNullable paths for request-associated diversity report artifacts

Current artifacts keys:

  • run_root
  • generated_dir
  • filter_dir
  • filtered_corpus_dir
  • effective_config_path
  • effective_config_trace_path
  • filter_manifest_path
  • filter_summary_path

Current diversity_artifacts keys:

  • summary_json_path
  • summary_md_path

throughput.generation_stage and throughput.filter_stage report request-run wall-clock stage timing from the dagzoo request workflow. filter_summary.json remains the underlying deferred-filter artifact and retains its own timing semantics.

dagzoo request does not run a diversity audit automatically, so the diversity_artifacts values are currently null unless a separate workflow persists request-associated diversity outputs alongside the run.


Parquet column schema

Shard-level train.parquet and test.parquet both use packed row-wise records:

ColumnTypeDescription
dataset_indexint64Global dataset index for this row
row_indexint64Row index within the dataset split
xlist[float32/float64]Full feature vector for this row
yint64 or floatTarget value for this row (task-dependent)

Compression: zstd (default).

Feature typing metadata remains per-dataset in metadata.ndjson records.


Metadata NDJSON structure

Each shard writes one metadata.ndjson file with one JSON record per dataset. Each line contains:

KeyTypeDescription
dataset_indexintGlobal dataset index
n_trainintTrain row count for the dataset
n_testintTest row count for the dataset
n_featuresintFeature count for the dataset
feature_typeslist[str]Per-feature type annotations (num/cat)
metadataobjectThe dataset metadata payload described below

metadata contains the dataset-level generation metadata described below.

Top-level keys

KeyTypeDescription
backendstrAlways "torch"
devicestrCompute device (e.g., "cpu", "cuda")
requested_devicestrRequested runtime device after CLI/config normalization (for example auto, cpu, cuda, mps)
resolved_devicestrRuntime backend selected from the requested device for generation
device_fallback_reasonstr or nullReserved field retained for artifact-contract stability; currently always null
compute_backendstrImplementation variant identifier
n_featuresintNumber of features
n_categorical_featuresintNumber of categorical features
n_classesint or nullRealized class count in emitted labels (null for regression)
graph_nodesintNumber of nodes in the DAG
graph_edgesintNumber of edges in the DAG
graph_depth_nodesintLongest path length in the DAG
graph_edge_densityfloatEdge count / max possible edges
seedintReplay seed recorded by the emitting API. Canonical generation stores the shared run seed here.
dataset_seedintOptional canonical per-dataset child seed derived from seed; used for deferred replay/diagnostics
dataset_indexintOptional canonical dataset position within the run (0-based)
run_num_datasetsintOptional canonical run length used to replay the saved bundle
attempt_usedintGeneration attempt index (0-based)
lineageobjectDAG lineage record (see Lineage below)
shiftobjectResolved shift settings and realized observability signals
noise_distributionobjectResolved noise-family selection and effective sampling params
configobjectFull serialized generator configuration
filterobjectFilter results (see below)
class_structureobjectPresent only for classification (see below)
missingnessobjectPresent only when missingness is enabled
layout_modestrOptional canonical layout metadata ("fixed" for canonical generation outputs)
layout_plan_seedintOptional internal seed used to sample the shared per-run layout
layout_signaturestrOptional deterministic fingerprint for the shared sampled layout
layout_plan_signaturestrOptional deterministic fingerprint for the internal frozen node execution payload
layout_plan_schema_versionintOptional internal metadata version for the canonical shared-layout payload
layout_execution_contractstrOptional internal execution contract identifier for canonical determinism
keyed_replayobjectOptional exact keyed subtree replay paths for canonical layout, execution, and dataset roots
mechanism_familiesobjectOptional realized mechanism-family and mechanism-variant coverage for canonical fixed-layout runs

For canonical generation (generate_one, generate_batch, generate_batch_iter, and dagzoo generate), replay later bundles with the shared seed, run_num_datasets, and dataset_index by regenerating the canonical batch and selecting that index. dataset_seed preserves the per-bundle child seed for deferred replay and diagnostics. Exact keyed subtree replay uses seed together with the keyed_replay paths.

keyed_replay sub-object

Present for canonical generation outputs. These paths are interpreted relative to KeyedRng(metadata["seed"]):

KeyTypeDescription
layout_root_pathlist[str | int]Exact keyed path for replaying the shared per-run layout
execution_plan_root_pathlist[str | int]Exact keyed path for replaying the shared execution subtree
dataset_root_pathlist[str | int]Exact keyed path for replaying one bundle’s dataset subtree

mechanism_families sub-object

Present for canonical fixed-layout generation outputs.

KeyTypeDescription
sampled_family_countsobjectRealized per-family function-plan counts for the sampled execution plan
families_presentlist[str]Sorted family labels with non-zero realized count
sampled_variant_countsobjectRealized per-variant counts for internal widened families such as gp.*
variants_presentlist[str]Sorted variant labels with non-zero realized count
total_function_plansintTotal realized function-plan count across the execution plan

Current variant labels are additive observability fields, not separate public family config names. Today they are emitted for widened gp execution as gp.standard, gp.periodic, and gp.multiscale.

Coverage, diversity-audit, and filter-calibration summary artifacts reuse these counts inside mechanism_family_summary and add dataset-level presence-rate fields:

  • dataset_presence_rate_by_family
  • dataset_presence_rate_by_variant

Shift sub-object

Present for all generated bundles. When shift is disabled, scales are 0.0 and multipliers are 1.0.

KeyTypeDescription
enabledboolWhether shift controls were enabled
modestrResolved shift mode (off, graph_drift, etc.)
graph_scalefloatResolved graph drift scale
mechanism_scalefloatResolved mechanism drift scale
variance_scalefloatResolved noise drift scale
edge_logit_bias_shiftfloatAdditive shift applied to edge logits
mechanism_logit_tiltfloatMechanism-family tilt applied at sampling
variance_sigma_multiplierfloatSigma multiplier applied to stochastic noise
edge_odds_multiplierfloatEdge-odds multiplier (exp(edge_logit_bias_shift))
noise_variance_multiplierfloatNoise-variance multiplier (variance_sigma_multiplier^2)
mechanism_nonlinear_massfloatProbability mass on nonlinear mechanism families ([0, 1])

Noise Distribution sub-object

Present for all generated bundles.

KeyTypeDescription
family_requestedstrConfigured noise family (gaussian, laplace, student_t, mixture)
family_sampledstrEffective family used by the dataset generation runtime
sampling_strategystrRuntime selection strategy (dataset_level)
base_scalefloatBase noise scale from config
student_t_dffloatStudent-t degrees of freedom parameter used by the runtime
mixture_weightsobject or nullEffective normalized mixture weights when family_requested=mixture

Filter sub-object

KeyTypeDescription
modestrFilter execution mode. Current value is deferred.
statusstrnot_run for freshly generated outputs; accepted/rejected after dagzoo filter.
enabledboolPresent after deferred filter replay. Always true when replayed.
acceptedboolPresent after deferred filter replay.
wins_ratiofloatBootstrap wins ratio (present after deferred replay).
n_valid_oobintOOB sample count (present after deferred replay).
backendstrFilter implementation identifier (present after deferred replay).
threshold_requestedfloatRequested filter threshold before class-aware adjustment (present after deferred replay).
threshold_effectivefloatEffective threshold used in acceptance decision (present after deferred replay).
threshold_policystrThreshold policy identifier (class_aware_piecewise_v1) (present after deferred replay).
class_countint or nullRealized class count used by filter (null for regression) (present after deferred replay).
class_bucketstrClass-count bucket for policy lookup (present after deferred replay).
threshold_deltafloatDifference between requested and effective threshold (present after deferred replay).
reasonstrPresent on rejected outputs when replay emits a specific rejection reason.

Class Structure sub-object (classification only)

Present only for classification datasets.

KeyTypeDescription
n_classes_sampledintLayout-sampled class count before postprocessing
n_classes_realizedintUnique class count in emitted y_train + y_test
labels_contiguousboolWhether labels form contiguous range 0..K-1
train_test_class_matchboolWhether train and test class sets are identical
min_labelint or nullMinimum emitted class label
max_labelint or nullMaximum emitted class label

Fixed-layout metadata

Present for all canonical generation outputs. These bundles share one sampled layout per run and preserve emitted column alignment (feature count, column order, and lineage feature-to-node mapping) within that run.

KeyTypeDescription
layout_modestr"fixed"
layout_plan_seedintInternal seed used to sample the shared per-run layout
layout_signaturestrStable fingerprint for the shared sampled layout
layout_plan_signaturestrStable fingerprint for the frozen internal node payload
layout_plan_schema_versionintInternal canonical layout metadata version
layout_execution_contractstrInternal execution contract (chunk_batched_v1)
keyed_replayobjectExact keyed subtree replay paths for layout/execution/dataset roots

Under chunk_batched_v1, canonical fixed-layout outputs are deterministic for the same run seed and realized run shape. Internal plan metadata records the shared sampled layout and execution-plan fingerprint used for that run, while keyed_replay records the exact keyed subtree roots needed for internal replay.

Missingness sub-object (optional)

Present only when missingness is enabled.

KeyTypeDescription
enabledboolAlways true when present
mechanismstr"mcar", "mar", or "mnar"
target_ratefloatConfigured missing rate
realized_rate_trainfloatActual missing fraction in train split
realized_rate_testfloatActual missing fraction in test split
realized_rate_overallfloatActual missing fraction overall
missing_count_trainintNumber of missing cells in train
missing_count_testintNumber of missing cells in test
missing_count_overallintTotal missing cells

Lineage schema

Schema name: dagzoo.dag_lineage

Version 1.0.0 (dense, in-memory)

Used in the in-memory metadata lineage field during generation. When lineage is persisted to disk, payloads are rewritten to compact version 1.1.0.

{
  "schema_name": "dagzoo.dag_lineage",
  "schema_version": "1.0.0",
  "graph": {
    "n_nodes": 8,
    "adjacency": [[0, 1, 0, ...], ...]
  },
  "assignments": {
    "feature_to_node": [2, 3, 5, 7],
    "target_to_node": 7
  }
}
  • adjacency is an n_nodes x n_nodes list of lists. Entries are 0 or 1. Upper-triangular only; diagonal is always 0. Direction convention is adjacency[src][dst] (src -> dst), so parents of node j are found from column j.
  • feature_to_node[i] is the DAG node index that produces feature i.
  • target_to_node is the DAG node index that produces the target.

Version 1.1.0 (compact, on-disk)

Used in metadata.ndjson dataset records when lineage artifacts are written to disk. Replaces the dense adjacency matrix with a reference to bitpacked binary data.

{
  "schema_name": "dagzoo.dag_lineage",
  "schema_version": "1.1.0",
  "graph": {
    "n_nodes": 8,
    "edge_count": 12,
    "adjacency_ref": {
      "encoding": "upper_triangle_bitpack_v1",
      "blob_path": "lineage/adjacency.bitpack.bin",
      "index_path": "lineage/adjacency.index.json",
      "dataset_index": 0,
      "bit_offset": 0,
      "bit_length": 28,
      "sha256": "a1b2c3..."
    }
  },
  "assignments": {
    "feature_to_node": [2, 3, 5, 7],
    "target_to_node": 7
  }
}

Adjacency encoding: upper_triangle_bitpack_v1

  • Packs the n_nodes * (n_nodes - 1) / 2 upper-triangle bits into bytes.
  • Bit order: little-endian.
  • bit_offset and bit_length locate this dataset’s bits within the shared shard-level blob file.
  • sha256 is a hex-encoded SHA-256 checksum of the packed bytes for this dataset’s adjacency data.

Lineage index file

Each shard contains lineage/adjacency.index.json with:

  • schema_name and schema_version — echo the lineage schema identifiers.
  • encoding — always "upper_triangle_bitpack_v1".
  • records — array of per-dataset offset/length/checksum entries.

Contract guarantees

Determinism — seed derivation is deterministic. For fixed seed and configuration, runs are expected to reproduce metadata and numerical outputs within tolerance. Strict byte-identical tensors/files are not guaranteed across all backends.

Feature alignmentfeature_types[i] in each metadata record describes feature index i inside packed x row vectors and tensor column index i in X_train / X_test.

Lineage integrity — each dataset’s bitpacked adjacency data is protected by a SHA-256 checksum recorded in the metadata.

Postprocessing invariants:

  • Canonical generation (generate_one, generate_batch, generate_batch_iter) is fixed-layout-backed and preserves emitted feature schema across the run: constant-column removal and feature-column permutation are disabled.
  • Numeric features are clipped and standardized (approximately zero mean, unit variance).
  • Classification target classes are randomly permuted (label indices carry no ordinal meaning).