Dagzoo Transform Reference

Formal transform equations, notation, and operator definitions.

This document is the formal reference for the generation transforms used by dagzoo. The equations are written to match the current runtime in src/dagzoo.

Using This Reference

Use this page when you want the mathematical definitions behind the generator. Each section covers one part of the runtime: DAG sampling, shift, mechanism families, node execution, converters, and noise. Together they describe how dagzoo maps a latent DAG into emitted tabular features, targets, and metadata.

Section 1 (DAG structure) — topology of causal dependencies
Section 2 (Shift parameters) — train/test distribution drift
Section 3 (Mechanism families) — functional complexity of variable relationships
Section 4 (Node pipeline) — execution mechanics from latent to converter-ready state
Section 5 (Mechanism definitions) — catalog of functional transform families
Section 6 (Converters) — mapping from latent state to observed tabular features
Section 7 (Noise) — stochastic variation character

The sections are organized so you can either read end to end or jump directly to the transform family you need. When you are checking a config or a runtime behavior, the practical question is: which transform changes, and how does that change show up in the emitted dataset?

Notation and Symbols

Primary Variable: Symbol map (this section's notation table).
Dependency Map: all symbols used in later equations -> definitions in this table.
Path to Final Output: unambiguous symbols -> unambiguous equations -> correct interpretation of how transforms produce X, y, and metadata.

Rationale. This table fixes symbol meanings up front so equations stay unambiguous across sections. Symbol names mirror runtime naming in core modules, making the spec easier to trace back to code.

Symbol	Meaning
\(N\)	Number of latent DAG nodes.
\(n\)	Number of rows sampled for one dataset attempt (\(n = n_{\text{train}} + n_{\text{test}}\)).
\(G \in {0,1}^{N \times N}\)	DAG adjacency matrix with convention `G[src, dst]`.
\(i, j\)	Node indices in \({0,\dots,N-1}\).
\(\mathsf{pa}(j)\)	Parent index set for node \(j\): \({i : G_{ij}=1}\).
\(\sigma(x)\)	Logistic sigmoid \((1 + e^{-x})^{-1}\).
\(A, B_i, C_j\)	DAG latent logits: \(A\) is global edge propensity, \(B_i\) is source-node effect, \(C_j\) is destination-node effect.
\(\beta_{\text{edge}}\)	Edge logit bias applied in DAG sampling.
\(\tau\)	Mechanism logit tilt (`mechanism_logit_tilt`).
\(\lambda_f\)	Raw mechanism family weight for family \(f\) from `mechanism.function_family_mix`.
\(\pi_f^{(0)}\)	Baseline mechanism-family probability before logit tilt.
\(\pi_f\)	Final sampling probability for mechanism family \(f\).
\(\ell_f\)	Family base logit for family \(f\) (from `MECHANISM_FAMILY_BASE_LOGITS`).
\(\tilde{\ell}_f\)	Centered family base logit: \(\ell_f - \frac{1}{\lvert\mathcal{F}\rvert}\sum_{g\in\mathcal{F}}\ell_g\).
\(\mathcal{F}\)	Ordered mechanism family set: `(nn, tree, discretization, gp, linear, quadratic, em, product, piecewise)`.
\(Z_j \in \mathbb{R}^{n \times d}\)	Node-\(j\) latent tensor after node pipeline transforms.
\(d_{\text{req}}\)	Required latent width from converter specs: \(\sum_s \max(1, d_s)\).
\(d_{\text{extra}}\)	Additional sampled latent width: \(\max(1, \lfloor \mathsf{LogUniform}(1,32) \rfloor)\).
\(d\)	Node latent width used for mechanism output: \(d = d_{\text{req}} + d_{\text{extra}}\).
\(d_s\)	Width requested by converter spec \(s\) before max-with-1 clamp.
\(w \in \mathbb{R}^d\)	Positive random weight vector (normalized to sum 1, then permuted).
\(X\)	Generic mechanism input matrix.
\(Y\)	Generic mechanism output matrix.
\(p_{ij}\)	DAG edge probability for ordered pair \((i,j)\) with \(i < j\).
\(\gamma_\sigma\)	Noise standard-deviation multiplier from shift runtime (`variance_sigma_multiplier`).
\(\gamma_{\text{var}}\)	Noise variance multiplier: \(\gamma_{\text{var}} = \gamma_\sigma^2\).
\(\mathcal{D}_{\text{noise}}\)	Active runtime noise family (`gaussian`, `laplace`, or `student_t` after runtime selection).
\(C\)	Categorical cardinality for a converter target.
\(\mathsf{LSE}\)	LogSumExp along parent axis.
\(T\)	Number of trees in a sampled tree ensemble.
\(P\)	Number of random features in GP approximation (fixed at 256).
\(s_j\)	Final node-level multiplicative latent scale.
\(\mathcal{O}\)	Final attempt output tuple: \((X, y, \text{metadata})\).

Operator Reference

Operator	Definition	Input/Output Shape Semantics	Where Used	Output Impact
\(\mathsf{pa}(j)\)	Parent index set of node \(j\): \({i: G_{ij}=1}\).	Input: scalar node index \(j\) and adjacency \(G\); Output: set of parent indices.	Section 1 and Section 4.	Determines which upstream node tensors are composed for node \(j\).
\(\mathsf{Agg}(Y_1,\dots,Y_k)\)	Multi-parent aggregation operator sampled from \({\sum,\prod,\max,\mathsf{LSE}}\).	Input: parent-aligned tensors \((n \times d)\); Output: one \((n \times d)\) tensor.	Section 4.2.	Changes how parent effects combine before downstream mechanism/converter steps.
\(\mathsf{LSE}(Y_1,\dots,Y_k)\)	LogSumExp over parent axis: \(\log\sum_r e^{Y_r}\).	Input: stacked parent outputs; Output: aggregated tensor \((n \times d)\).	Section 4.2.	Smoothly emphasizes larger parent responses while retaining contributions from others.
\(\arg\min_k\)	Index of minimum value over candidate index \(k\).	Input: candidate scores/distances; Output: integer index per row.	Sections 5.5 and 6.2.	Produces center assignments that drive discrete or categorical structure.
\(\sum,\prod,\max\)	Sum/product/max reductions over specified axis.	Input: tensors with reduction axis; Output: reduced tensor.	Sections 4.2, 4.3, 5.2, 5.4, 5.7.	Controls interaction form (additive, multiplicative, extremal) in latent construction.
\(\mathsf{Converter}_s(V_s)\)	Converter function for spec \(s\), returning transformed slice and extracted value.	Input: latent view \(V_s\); Output: \((X'_s, v_s)\).	Section 4.3 and Section 6.	Directly emits observed feature values that become final complete-data `X`; the target is then generated by the conditional complete-data head, and optional missingness may later mask the emitted features.
\(\mathsf{Linear}(\cdot)\)	Random linear map to the configured output width.	Input: matrix \((n \times d_{in})\); Output: matrix \((n \times d_{out})\).	Sections 5.5 and 5.7.	Projects intermediate representations into node latent channels used by converters.
\(\mathsf{leaf}(X)\)	Oblivious-tree leaf index determined by sampled splits/thresholds.	Input: row vector(s) and tree splits; Output: leaf id(s).	Section 5.4.	Selects leaf values that define piecewise latent behavior.
\(\mathsf{logit}_{ik}\)	Pre-softmax score for row \(i\), component/class \(k\).	Input: row-to-center distances and scale terms; Output: score matrix.	Section 5.7.	Determines assignment probabilities that shape EM-style latent outputs.
\(\mathsf{softmax}_k(\cdot)\)	Normalize exponentiated logits over index \(k\) to probabilities.	Input: logits per row; Output: probabilities summing to 1 across \(k\).	Sections 5.7 and 6.2.	Converts scores into stochastic assignments that drive categorical/mixture behavior.
\(\mathsf{sample\_noise}(\cdot)\)	Noise sampler with configurable family and scale.	Input: shape + family params; Output: random tensor with requested shape.	Section 7.1.	Injects stochastic variation into points, matrices, weights, and node outputs.
\(\mathsf{clip}(x,a,b)\)	Clamp each entry of \(x\) into \([a,b]\).	Input: tensor; Output: tensor with same shape.	Section 4.3.	Prevents extreme values from destabilizing downstream standardization/conversion.
\(\mathsf{nan\_to\_num}(x)\)	Replace NaN/Inf entries with finite values.	Input: tensor; Output: finite tensor with same shape.	Section 4.3.	Stabilizes latent values before weighting and conversion.
\(\mathsf{standardize}(x)\)	Column-wise centering and scaling to unit variance (with epsilon guard).	Input: matrix \((n \times d)\); Output: matrix \((n \times d)\).	Sections 4.3 and 6.2.	Controls scale so mechanism outputs/converters are numerically comparable.
\(\mathsf{Cauchy}(0,1)\)	Standard Cauchy distribution.	Input: location/scale; Output: scalar or tensor draws.	Section 1.	Provides heavy-tailed latent logits for heterogeneous graph connectivity.
\(\mathsf{Bernoulli}(p)\)	Bernoulli draw with success probability \(p\).	Input: probability \(p \in [0,1]\); Output: binary draw.	Section 1.	Converts edge probabilities into realized adjacency \(G\).
\(\mathsf{Uniform}(a,b)\)	Continuous uniform distribution on \([a,b]\).	Input: bounds \((a,b)\); Output: scalar/tensor draws.	Sections 4.1 and 7.1.	Drives root sampling and inverse-CDF noise construction.
\(\mathsf{UniformInt}\{m,\dots,n\}\)	Discrete uniform integer sampler on inclusive range.	Input: integer bounds; Output: integer draws.	Section 5.4.	Samples tree depths and related discrete structure choices.
\(\mathsf{LogUniform}(a,b)\)	Distribution with \(\log x\) uniform on \([\log a,\log b]\).	Input: positive bounds \((a,b)\); Output: positive scalar draws.	Sections 4.3, 5.5, 6.1, 6.2.	Controls scales/exponents over multiple orders of magnitude.
\(\mathsf{Categorical}(p)\)	Discrete draw from class probabilities \(p\).	Input: probability vector per row; Output: class/label index.	Section 6.2.	Emits categorical feature labels that populate final observed features.

Conventions:

Unless noted otherwise, operations are row-wise over batch dimension \(n\) and preserve row count.
In \(\mathsf{Agg}\), the aggregation kind is sampled once per multi-parent composition call and then applied elementwise across parents.
In noise sampling, mixture in Section 7.1 samples component assignment per element, while Section 7.2 resolves one family per dataset attempt before downstream draws.

End-to-End Primary Variable Map

Primary Variable: \(\mathcal{O} = (X, y, \text{metadata})\).
Dependency Map: \(s_{\text{shift}}, A, B_i, C_j, \lambda_f, \tilde{\ell}_f, \text{noise\_spec}, \epsilon, G, \pi_f, Z_{\text{base}}, Z_{\text{comp}}, Y, Z_{\text{post}}, (X'_s, v_s) \rightarrow \mathcal{O}\).
Path to Final Output: shift runtime values define graph, mechanism, and noise controls; these drive node latents and converter emissions that become \((X,y)\), while runtime selections and graph/mechanism/noise summaries become metadata in \(\mathcal{O}\).

Rationale. This section is the single whole-picture view that ties section-level primary variables into one output contract. It complements the per-section formal details by making global dataflow explicit.

Pipeline skeleton:

\[s_{\text{shift}} \rightarrow (\beta_{\text{edge}}, \tau, \gamma_{\sigma}, \gamma_{\text{var}})\] \[\left(\beta_{\text{edge}}, A, B_i, C_j\right) \rightarrow G\] \[\left(\lambda_f, \tau, \tilde{\ell}_f\right) \rightarrow \pi_f\] \[\text{noise\_spec} \rightarrow \epsilon_{\text{family}} \rightarrow \epsilon\] \[\left(G, \pi_f, \epsilon, Z_{\text{base}}, Z_{\text{comp}}, Y, Z_{\text{post}}\right) \rightarrow {Z_j}_{j=0}^{N-1}\] \[\left({Z_j}, \mathsf{Converter}_s\right) \rightarrow {(X'_s, v_s)}_s \rightarrow (X, y)\] \[\left(G, \pi_f, \text{noise\_spec}, s_{\text{shift}}\right) \rightarrow \text{metadata}\] \[\mathcal{O} = (X, y, \text{metadata})\]

Operator role notes:

\(\mathsf{Agg}\) is the bridge from multiple parent tensors to \(Z_{\text{comp}}\) (Section 4.2).
\(\mathsf{Converter}_s\) is the bridge from latent state to emitted observable values (Section 4.3 and Section 6).

Primary variable crosswalk:

Section	Primary Variable	Immediate Inputs	Immediate Output	Contribution to \(\mathcal{O}\)
1. DAG Structure Sampling	\(G\)	\(A, B_i, C_j, \beta_{\text{edge}}\)	Parent sets \(\mathsf{pa}(j)\)	Controls execution order and parent flow that shapes \((X,y)\) and graph metadata.
2. Shift Runtime Parameters	\(s_{\text{shift}}\)	Resolved shift config	\((\beta_{\text{edge}}, \tau, \gamma_{\sigma}, \gamma_{\text{var}})\)	Couples drift controls into graph density, mechanism usage, and noise magnitude in outputs/metadata.
3. Mechanism Family Sampling	\(\pi_f\)	\(\lambda_f, \tau, \tilde{\ell}_f, \mathcal{F}\)	Family draw distribution per mechanism call	Determines which transform family generates latent structure before conversion to \((X,y)\).
4. Node Generation Pipeline	\(Z_j\)	Parent/root inputs, width terms, sampled mechanisms, weights, scale	Node-level latent tensor	Shared latent state from which converter slices emit observed features/targets.
4.1 Root-node base sampling	\(Z_{\text{base}}\)	Base-kind selection, distribution params, \(\mathcal{D}_{\text{noise}}\)	Initial latent input for root mechanisms	Seeds root latent variability that propagates to downstream nodes and final observables.
4.2 Parent composition	\(Z_{\text{comp}}\)	Parent tensors, concat/aggregation path, \(\mathsf{Agg}\)	Composed latent input for node transform	Defines multi-parent interaction pattern that feeds mechanism outputs used in \((X,y)\).
4.3 Post-processing and slicing	\(Z_{\text{post}}\)	Raw latent, sanitize/standardize, weighting, padding, slice writeback	Converter-ready latent state	Directly feeds converter extraction and latent feedback that determines emitted values.
5. Mechanism Definitions	\(Y = f(X)\)	Mechanism parameters and input \(X\)	Mechanism output tensor	Family-level nonlinear map that populates latent channels later converted to outputs.
5.1 Linear	\(Y_{\text{linear}}\)	\(X, M\)	Linear latent channels	Baseline low-complexity latent effects converted into numeric/categorical outputs.
5.2 Quadratic	\(Y_{\text{quad}}\)	\(X_{\text{sub}}, X_{\text{aug}}, M_t\)	Interaction-heavy latent channels	Adds pairwise interaction structure reflected in emitted features/targets.
5.3 Neural network	\(Y_{\text{nn}}\)	\(X, {M_{\ell}}, {\phi_{\ell}}\)	Deep nonlinear latent channels	Produces high-capacity nonlinear signals passed through converters into \((X,y)\).
5.4 Tree ensemble	\(Y_{\text{tree}}\)	Splits, thresholds, leaf values	Piecewise latent channels	Introduces discontinuous region-based structure visible in output distributions.
5.5 Discretization	\(Y_{\text{disc}}\)	Centers, distance exponent, nearest-center assignment	Clustered latent channels	Creates quantized/cluster-like effects in converted observables.
5.6 GP/RFF	\(Y_{\text{gp}}\)	\(X_{\text{proj}}, \Omega, b, V\)	Smooth kernel-like latent channels	Adds smooth nonlinear components that converters emit as continuous structure.
5.7 EM assignment	\(Y_{\text{em}}\)	Centers, scales, exponents, assignment probabilities	Mixture-like latent channels	Encodes local mixture behavior that propagates to final emitted values.
5.8 Product family	\(Y_{\text{prod}}\)	Two component family outputs \(f(X), g(X)\)	Multiplicative latent channels	Adds higher-order multiplicative effects before converter extraction.
6. Converter Definitions	\((X'_s, v_s)\)	Latent slice, converter kind/params	Updated latent slice plus extracted value	Defines the emission boundary from latent state to observed feature values.
6.1 Numeric converter	\((X'_{\text{num}}, v_{\text{num}})\)	Numeric slice, warp params, branch choice	Numeric extracted value and feedback slice	Produces numeric entries in \((X,y)\) and latent feedback for downstream coupling.
6.2 Categorical converter	\((X'_{\text{cat}}, \ell)\)	Method/variant, cardinality, logits/centers	Categorical label and feedback slice	Produces categorical feature/class outputs and associated latent feedback.
7. Noise Sampling and Selection	\(\epsilon\)	Family, scale, family-specific params	Runtime noise draws	Injects stochastic variation throughout latent generation and converter inputs.
7.1 Primitive family samplers	\(\epsilon_{\text{family}}\)	Primitive family choice and parameters	Family-specific noise samples	Supplies concrete perturbation draws used by samplers/mechanisms.
7.2 Dataset-level family resolution	`noise_spec`	Requested family, mixture weights, seeded RNG	One attempt-level family spec	Determines dataset-level noise identity reflected in outputs and metadata.

1. DAG Structure Sampling

Motivation The graph structure is the causal skeleton that determines which variables influence which. Consider three corpora:

Corpus A (Erdős–Rényi, p=0.3): every DAG looks like a medium-density mesh → all datasets have ~similar dependency depth Corpus B (Erdős–Rényi, varied p): sparser and denser meshes, but still homogeneous per-node — no hub nodes, no isolated chains

Corpus C (Cauchy latents): some DAGs are star-shaped (one hub drives everything), some are sparse chains (A→B→C→D), some are dense webs → the prior covers qualitatively different causal regimes

Corpus C is what this section produces. The Cauchy distribution's heavy tails mean that occasionally one node's latent \(B_i\) is extreme, creating a natural hub; another node's \(C_j\) may be deeply negative, making it nearly parentless. This heterogeneity is deliberate — real tabular data comes from causal structures that range from near-independent features to deeply entangled causal chains, and the prior must cover that full range. The edge_logit_bias (\(\beta_{\text{edge}}\)) shifts the overall density surface up or down, acting as the "how connected" knob for the graph axis of effective diversity.

Primary Variable: \(G\) (DAG adjacency matrix).
Dependency Map: \(A, B_i, C_j, \beta_{\text{edge}} \rightarrow p_{ij} \rightarrow G_{ij}\).
Path to Final Output: \(G \rightarrow \mathsf{pa}(j)\) for each node -> node execution order and parent inputs -> latent tensors -> emitted X, y, and graph metadata.

Rationale. The latent-logit Cauchy construction yields heterogeneous edge probabilities while strict upper-triangular masking guarantees acyclicity. Writing \(p_{ij}\) and \(\beta_{\text{edge}}\) explicitly keeps graph-drift behavior inspectable and testable.

sample_dag draws a strict upper-triangular adjacency matrix.

\[A \sim \mathsf{Cauchy}(0,1), \quad B_i \sim \mathsf{Cauchy}(0,1), \quad C_j \sim \mathsf{Cauchy}(0,1)\]

Interpretation:

\(A\): global intercept-like term shifting all edge logits together.
\(B_i\): source-node effect for node \(i\) (how strongly node \(i\) tends to emit edges).
\(C_j\): destination-node effect for node \(j\) (how strongly node \(j\) tends to receive edges).

For each \(i < j\):

\[p_{ij} = \sigma(A + B_i + C_j + \beta_{\text{edge}}), \quad G_{ij} \sim \mathsf{Bernoulli}(p_{ij})\]

and for \(i \ge j\), \(G_{ij}=0\).

2. Shift Runtime Parameters

Motivation A model trained on i.i.d. synthetic data has never seen train/test mismatch during pretraining. At inference, real tasks routinely exhibit shift — and the model has no prior experience to fall back on. These three shift axes map to distinct real-world drift patterns:

graph_scale → causal structure itself changes (e.g., new regulations alter which between train and test variables are causally relevant) mechanism_scale → functional relationships change (e.g., customer behavior shifts while while the graph stays the same the same features remain relevant)

variance_scale → stochastic variation changes (e.g., sensor degradation increases in magnitude measurement noise over time)

Because the three axes are independent, a researcher can ablate each in isolation: hold graph and mechanism fixed while sweeping noise drift to measure noise-robustness, or hold noise fixed while tilting mechanism complexity. Shift-aware constructions in the prior have been shown to improve robustness to temporal distribution shifts. The deterministic mappings from config scales to runtime coefficients (\(\beta_{\text{edge}}\), \(\tau\), \(\gamma_\sigma\)) keep each axis interpretable and independently testable.

Primary Variable: \(s_{\text{shift}} = (\text{graph\_scale}, \text{mechanism\_scale}, \text{variance\_scale})\).
Dependency Map: \(s_{\text{shift}} \rightarrow (\beta_{\text{edge}}, \tau, \gamma_{\sigma}, \gamma_{\text{var}})\) via deterministic runtime mappings.
Path to Final Output: \(\beta_{\text{edge}} \rightarrow G\), \(\tau \rightarrow \pi_f\), and \(\gamma_{\sigma} \rightarrow\) noise amplitude; these three paths converge in node generation and sampling to shape final X, y, and shift metadata.

Rationale. Shift controls are defined as deterministic mappings from config scales to runtime coefficients so drift remains interpretable. Separating graph, mechanism, and variance mappings keeps each drift axis analyzable in isolation.

With resolved shift scales (graph_scale, mechanism_scale, variance_scale):

\[\beta_{\text{edge}} = \ln(2) \cdot \text{graph\_scale}\] \[\tau = \text{mechanism\_scale}\] \[\gamma_\sigma = \exp\left(\frac{\ln(2)}{2} \cdot \text{variance\_scale}\right), \quad \gamma_{\text{var}} = \gamma_\sigma^2 = 2^{\text{variance\_scale}}\]

So positive variance_scale increases global noise variance multiplicatively.

3. Mechanism Family Sampling with Mix + Tilt

Motivation Real tabular relationships span a wide functional spectrum. A prior restricted to, say, linear + neural-network mechanisms will never produce piecewise decision boundaries (tree-like), cluster-assignment patterns (discretization), or smooth periodic surfaces (GP) — and the foundation model trained on it will struggle when encountering those patterns at inference time.

linear         →  y ≈ Wx + b                    (low complexity, smooth)
quadratic      →  y ≈ x'Ax + Wx + b             (pairwise interactions)
neural network →  y = σ(W₂ σ(W₁x + b₁) + b₂)   (high-capacity smooth nonlinearity)
tree ensemble  →  y = Σ leaf_values[split_path]  (piecewise constant, discontinuous)
discretization →  y = f(nearest_center(x))       (clustered, Voronoi-like)
GP / RFF       →  y = V cos(Ωx + b)             (smooth kernel, possibly periodic)
EM assignment  →  y = Σ p(k|x) · μ_k            (soft mixture, local structure)
product        →  y = f(x) · g(x)               (multiplicative interaction)
piecewise      →  y = staged_branches(x)         (explicit control path)

The function_family_mix controls which families are present (support) and their baseline probabilities. The mechanism_logit_tilt (\(\tau\)) controls how much the distribution shifts toward nonlinear families within that support. Together they form the most direct lever for changing the functional complexity profile of the prior. If your model underperforms on tasks with tree-like decision boundaries, increasing tree weight in the mix and measuring the downstream effect is the first experiment to try.

Primary Variable: \(\pi_f\) (mechanism-family sampling distribution).
Dependency Map: \(\lambda_f, \tau, \tilde{\ell}_f, \mathcal{F} \rightarrow \pi_f\).
Path to Final Output: \(\pi_f \rightarrow\) sampled mechanism family at each function draw -> latent transform type and nonlinearity -> converted observable values in X/y and nonlinear-mass-related metadata.

Rationale. This parameterization separates family support control (function_family_mix) from directional drift (mechanism_logit_tilt). Centered logits make tilt a relative reweighting mechanism rather than an arbitrary global scaling change.

Let raw family weights be:

\[\lambda_f = \begin{cases} 1, & \text{if no function\_family\_mix} \\ \text{configured positive weight}, & \text{otherwise} \end{cases}\]

Normalize over positive-weight families:

\[\pi_f^{(0)} = \frac{\lambda_f}{\sum_{g \in \mathcal{F}} \lambda_g}\]

\(\pi_f^{(0)}\) is the baseline family distribution after applying function_family_mix support/weights and before any tilt.

If \(\tau \le 0\), then \(\pi_f = \pi_f^{(0)}\). If \(\tau > 0\), logits are tilted using centered family base logits:

\[\pi_f \propto \exp\left(\ln \pi_f^{(0)} + \tau \tilde{\ell}_f\right)\]

then normalized to sum to 1.

Configuration intuition:

mechanism.function_family_mix controls support and baseline weights through \(\lambda_f\).
shift.mechanism_scale sets \(\tau\) (Section 2), which reweights within that support.
Increasing \(\tau\) shifts mass toward families with higher \(\tilde{\ell}_f\) and away from families with lower \(\tilde{\ell}_f\), changing mechanism usage frequencies that shape final X/y.
Example: with mix {nn: 0.7, linear: 0.3}, \(\tau=0\) gives \(\pi^{(0)}_{nn}=0.7,\pi^{(0)}_{linear}=0.3\); with \(\tau>0\), \(nn\) gains additional mass because its centered base logit is higher than linear's.

Base logits used in \(\tilde{\ell}_f\) (ONLY IF \(\tau > 0\)):

Family \(f\)	\(\ell_f\)
`nn`	0.7
`tree`	0.7
`discretization`	0.5
`gp`	0.5
`linear`	-0.8
`quadratic`	-0.6
`em`	-0.3
`product`	0.9

Rationale for Non-uniform Base Logits

The base logits define the direction of mechanism drift, not the default family distribution.

When mechanism_logit_tilt <= 0, family probabilities are just normalized family weights (uniform when no function_family_mix is provided), so base logits have no effect.
When mechanism_logit_tilt > 0, centered base logits reweight probabilities toward families with larger centered logits and away from families with smaller centered logits.
In the current defaults, this shifts mass toward nonlinear families (nn, tree, discretization, gp, product) and away from simpler families (linear, quadratic), increasing nonlinear mechanism mass under positive tilt.
function_family_mix still controls support and baseline weights; base logits only reweight within enabled families.

Implementation note: product is only valid when at least one of its component families is enabled (tree, discretization, gp, linear, quadratic).

4. Node Generation Pipeline

Motivation The node pipeline is where latent causal structure becomes concrete tensors. Each step — root sampling, parent composition, sanitization, weighting, converter slicing — adds a controlled layer of variation. The design is ordered so that:

root nodes   → fresh latent geometry (normal, uniform, unit-ball, correlated)
child nodes  → inherit + transform parent signals via multi-parent composition
all nodes    → sanitize → standardize → weight → slice for converters → scale

Multiple root-base geometries (Section 4.1) ensure that even before any mechanism fires, latent inputs already vary in shape — uniform grids, spherical point clouds, correlated ellipsoids. Multi-parent composition (Section 4.2) controls how upstream causal effects combine: additive (sum), multiplicative (product), extremal (max), or smooth-extremal (logsumexp). These choices determine the interaction patterns visible in the final tabular data — a sum aggregation produces additive parent effects, while a product aggregation creates multiplicative interactions that no single-parent mechanism can mimic.

Primary Variable: \(Z_j\) (node-\(j\) latent tensor after pipeline transforms).
Dependency Map: root/parent inputs, mechanism draws, width terms \((d_{\text{req}}, d_{\text{extra}})\), random weights \(w\), and scale \(s_j \rightarrow Z_j\).
Path to Final Output: \(Z_j \rightarrow\) converter slices, extracted feature values, and one selected target-node conversion -> final X/y.

Rationale. Node generation is documented as an ordered contract because converter extraction and metadata semantics depend on stable execution order. The split into root sampling, parent composition, and latent refinement mirrors the actual runtime path.

4.1 Root-node base sampling

Primary Variable: \(Z_{\text{base}}\).
Dependency Map: base-kind choice + distribution parameters + active noise family \(\mathcal{D}_{\text{noise}} \rightarrow Z_{\text{base}}\).
Path to Final Output: \(Z_{\text{base}} \rightarrow f(Z_{\text{base}})\) (initial node latent) -> downstream node pipeline -> feature converter extraction -> X/y.

Rationale. Multiple root base geometries increase latent diversity before any parent effects exist. Passing every sampled base through \(f(\cdot)\) keeps root and non-root nodes aligned under the same mechanism-family controls.

For root nodes (\(\mathsf{pa}(j) = \varnothing\)), one base kind is chosen uniformly from:

normal: \(Z_{\text{base}} \sim \mathcal{D}_{\text{noise}}^{n \times d}\).
uniform: elementwise \(\mathsf{Uniform}(-1,1)\).
unit_ball: for each row, \[v \sim \mathcal{N}(0, I_d), \quad u \sim \mathsf{Uniform}(0,1), \quad z = \frac{v}{|v|_2} u^{1/d}\]
normal_cov: \[E \sim \mathcal{D}_{\text{noise}}^{n \times d}, \quad A \sim \mathcal{D}_{\text{noise}}^{d \times d}, \quad Z_{\text{base}} = (E \odot w^\top) A^\top\] where \(w\) is sampled by sample_random_weights (a positive, sum-to-one, randomly permuted weight vector following a noisy power-law decay).

Then a random mechanism is applied:

\[Z_j = f(Z_{\text{base}})\]

4.2 Parent composition (non-root nodes)

Primary Variable: \(Z_{\text{comp}}\) (composed parent latent input to node transform).
Dependency Map: parent tensors \({Z_{p_r}}\) + composition path choice + aggregation operator \(\mathsf{Agg} \rightarrow Z_{\text{comp}}\).
Path to Final Output: \(Z_{\text{comp}} \rightarrow\) node mechanism output -> updated \(Z_j\) -> feature converter extraction plus target-node conversion -> emitted X/y.

Rationale. The 50/50 concat-versus-aggregation branch injects both feature-mixing and commutative interaction patterns. This broadens functional regimes without changing the node interface contract.

If \(|\mathsf{pa}(j)|=1\), apply one random mechanism to that parent tensor.

If \(|\mathsf{pa}(j)| \ge 2\), choose one path with equal probability:

Concatenation path: \[Z_{\text{in}} = [Z_{p_1} | Z_{p_2} | \dots | Z_{p_k}], \quad Z_j = f(Z_{\text{in}})\]
Aggregation path: \[Y_r = f_r(Z_{p_r}), \quad Z_j = \mathsf{Agg}(Y_1, \dots, Y_k)\] with \(\mathsf{Agg}\) sampled uniformly from: \[\sum_r Y_r, \quad \prod_r Y_r, \quad \max_r Y_r, \quad \mathsf{LSE}(Y_1, \dots, Y_k)\]

Concatenation semantics (implementation-faithful):

If parent tensors are \(Z_{p_r} \in \mathbb{R}^{n \times d_r}\), concatenation is along feature axis: \[Z_{\text{in}} \in \mathbb{R}^{n \times \sum_{r=1}^k d_r}.\]
The concatenated tensor is passed through one random mechanism \(f(\cdot)\) with node output width target \(d\) (the node total_dim from Section 4.3), so: \[f: \mathbb{R}^{n \times \sum_r d_r} \rightarrow \mathbb{R}^{n \times d}.\]
In this branch, aggregation kind is not used; Agg only applies in the non-concatenation branch.

Example:

Three parents with shapes \((n,4)\), \((n,6)\), \((n,3)\) concatenate to \((n,13)\).
If node target width is \(d=9\), the mechanism maps \((n,13)\rightarrow(n,9)\) before post-processing and converter slicing.

4.3 Latent post-processing and converter slicing

Primary Variable: \(Z_{\text{post}}\) (post-processed latent tensor used for conversion).
Dependency Map: raw node latent + sanitize/standardize operations + random weights \(w\) + normalization + padding noise + slice/writeback updates \(\rightarrow Z_{\text{post}}\).
Path to Final Output: \(Z_{\text{post}} \rightarrow\) per-spec extracted values \((v_s)\) and rewritten latent slices -> assigned feature outputs plus one selected target-node conversion -> final X/y.

Rationale. Sanitization, standardization, weighting, and norm scaling are applied before conversion to keep numerics stable across mechanism families. The slice-and-writeback loop formalizes how observable values are extracted while preserving latent continuity.

Current default semantic note. In the shipped default prior, both feature converters and the target converter are attached to sampled latent nodes. One selected latent node emits y directly, and optional missingness acts later as an observation process over emitted features only.

For each node, converter specs define required width:

\[d_{\text{req}} = \sum_s \max(1, d_s), \quad d = d_{\text{req}} + d_{\text{extra}}, \quad d_{\text{extra}} = \max(1, \lfloor \mathsf{LogUniform}(1, 32) \rfloor)\]

After mechanism output:

\[Z_j \leftarrow \mathsf{clip}(\mathsf{nan\_to\_num}(Z_j), -10^6, 10^6)\] \[Z_j \leftarrow \mathsf{standardize}(Z_j)\] \[Z_j \leftarrow Z_j \odot w^\top\] \[Z_j \leftarrow \frac{Z_j}{\max\left(\frac{1}{n} \sum_{r=1}^n |Z_{j,r:}|_2, 10^{-6}\right)}\]

Converter loop semantics:

Slice per-spec view \(V_s\) from current cursor over columns.
Pad missing columns with sampled noise if needed.
Apply converter: \((X'_s, v_s) = \mathsf{Converter}_s(V_s)\).
Write \(X'_s\) back into the same slice (shape-padded/truncated as needed).
Store \(v_s\) as extracted observable value for the feature key.

Final node scaling:

\[Z_j \leftarrow s_j Z_j, \quad s_j \sim \mathsf{LogUniform}(0.1, 10)\]

5. Mechanism Family Definitions

Motivation This catalog is intentionally broad because each family occupies a qualitatively distinct region of function space. Narrowing the catalog narrows the prior:

Remove trees       → prior cannot produce sharp, axis-aligned decision boundaries
Remove GP          → prior cannot produce smooth periodic or multi-scale surfaces
Remove quadratic   → prior loses explicit pairwise feature interactions
Remove product     → prior loses multiplicative cross-family effects

The breadth of this catalog is what makes mechanism diversity a meaningful axis of effective diversity. Each family's parameterization (random weights, depths, kernel frequencies, etc.) adds within-family variation on top of the between-family structural differences. For a researcher evaluating whether to add or remove a family, the question is: does this family produce latent structure that no other family can approximate? If yes, removing it creates a coverage gap; if a proposed new family largely duplicates an existing one, adding it increases code surface without improving diversity.

Primary Variable: \(Y = f(X)\) (family-specific mechanism output).
Dependency Map: mechanism-family parameters + input \(X \rightarrow Y\).
Path to Final Output: \(Y \rightarrow\) node latent updates and feedback slices -> converter-extracted observables -> final X/y.

Rationale. The mechanism catalog intentionally spans smooth, piecewise, assignment-based, and compositional transforms to vary functional complexity under drift and mix controls. Each family maps to a concrete sampler implementation in random_functions.py.

Inputs to mechanism families are first sanitized and standardized.

5.1 Linear

Primary Variable: \(Y_{\text{linear}}\).
Dependency Map: \(X, M \rightarrow Y_{\text{linear}} = X M^\top\).
Path to Final Output: \(Y_{\text{linear}} \rightarrow\) latent columns for the node -> downstream converter extraction -> numeric/categorical outputs in X or y.

Rationale. Linear maps act as a low-complexity baseline and calibration point for drift against nonlinear families. Row-normalized random matrices preserve directional mixing while limiting runaway scale.

\[Y = X M^\top\]

where \(M\) is a sampled random matrix.

5.2 Quadratic

Primary Variable: \(Y_{\text{quad}}\).
Dependency Map: \(X_{\text{sub}}, X_{\text{aug}}, M_t \rightarrow Y_{\text{quad}}\).
Path to Final Output: \(Y_{\text{quad}} \rightarrow\) interaction-heavy latent signal -> converter extraction -> nonlinear structure in emitted features and in any downstream target-node ancestry.

Rationale. Quadratic forms add pairwise interactions not available in pure linear maps. Subspace capping and bias augmentation balance expressivity with numerical stability and compute control.

Let \(X_{\text{sub}}\) be either all columns or a random 20-column subset if input width exceeds 20. Augment with bias column: \(X_{\text{aug}} = [X_{\text{sub}}, \mathbf{1}]\). For each output channel \(t\):

\[Y_t = \mathsf{diag}(X_{\text{aug}}, M_t, X_{\text{aug}}^\top)\]

or equivalently per row: \(y_i = x_{\text{aug},i}^\top M_t, x_{\text{aug},i}\) — a general (not necessarily symmetric) quadratic form. The bias augmentation means this captures quadratic + linear + constant terms.

5.3 Neural network

Primary Variable: \(Y_{\text{nn}}\).
Dependency Map: \(X\), sampled layer matrices \({M_\ell}\), and sampled activations \({\phi_\ell}\) \(\rightarrow Y_{\text{nn}}\).
Path to Final Output: \(Y_{\text{nn}} \rightarrow\) high-capacity latent states -> converter extraction -> complex nonlinear observables in X/y.

Rationale. Random-depth MLPs provide high-capacity nonlinear transforms with stochastic activation structure. Optional pre/post activation steps increase shape diversity while retaining a simple sampled-layer pipeline.

A random depth/width MLP with number of weight matrices sampled uniformly from \({1, 2, 3}\) (0–2 hidden layers) and hidden width from \(\lfloor \mathsf{LogUniform}(1, 127) \rfloor\):

\[Y^{(0)} = X \text{ (optional pre-activation with prob 0.5)}\] \[Y^{(\ell+1)} = \phi_\ell(Y^{(\ell)} M_\ell^\top)\]

with activation after hidden layers, optional final activation (prob 0.5).

Activation sampler includes fixed and parametric families.

Fixed activations: tanh, leaky_relu, elu, identity, silu, swiglu, relu, relu_sq, softplus, sign, gauss, exp, sin, square, abs, softmax, logsigmoid, logabs, sigmoid, round, mod1, selu, relu6, hardtanh, indicator_01, onehot_argmax, argsort, rank.
Parametric families: relu_pow, signed_pow, inv_pow, poly, gumbel_softmax.

gumbel_softmax applies row-wise Gumbel perturbations followed by temperature-scaled softmax. When activation standardization is enabled later in the pipeline, those simplex-like row sums are no longer guaranteed to remain exactly 1 in the emitted latent state.

5.4 Tree ensemble (oblivious decision trees)

Primary Variable: \(Y_{\text{tree}}\).
Dependency Map: split features, thresholds, leaf indices, and leaf values \(\rightarrow Y_{\text{tree}}\).
Path to Final Output: \(Y_{\text{tree}} \rightarrow\) piecewise latent regions -> converter extraction -> discontinuous/tabular decision patterns in X/y.

Rationale. Oblivious trees introduce thresholded piecewise behavior that complements smooth families. Variance-weighted split sampling biases random partitions toward informative dimensions without requiring supervised fitting.

Number of trees: \(T = \max(1, \lfloor \mathsf{LogUniform}(1, 32) \rfloor)\).

For each tree \(t\):

Depth \(d_t \sim \mathsf{UniformInt}\{1, \dots, 7\}\).
Sample split feature indices (variance-weighted probabilities when available).
Sample thresholds by indexing sampled training rows.
Compute leaf index via packed split bits.
Sample leaf values \(V_t \in \mathbb{R}^{2^{d_t} \times d_{\text{out}}}\) from noise sampler.

Prediction is averaged:

\[Y = \frac{1}{T} \sum_{t=1}^T V_t[\mathsf{leaf}(X)]\]

5.5 Discretization

Primary Variable: \(Y_{\text{disc}}\).
Dependency Map: sampled centers + distance exponent \(p\) + nearest-center index + linear map \(\rightarrow Y_{\text{disc}}\).
Path to Final Output: \(Y_{\text{disc}} \rightarrow\) clustered latent representation -> converter extraction -> quantized/cluster-like observables in X/y.

Rationale. Nearest-center assignment creates clustered discontinuities and quantization effects that smooth mechanisms do not capture. A final linear map preserves shared output-shape contracts across families.

Number of centers: \(K = \mathsf{clamp}(\lfloor \mathsf{LogUniform}(2, 128) \rfloor, 2, n)\).

Sample centers from rows, compute \(L_p\)-style distances (with sampled \(p \sim \mathsf{LogUniform}(0.5, 4)\)), assign nearest center, then apply linear map:

\[k^{\ast}(i) = \arg\min_k \sum_q |X_{iq} - c_{kq}|^p, \quad Y = \mathsf{Linear}(c_{k^{\ast}(i)})\]

5.6 GP/RFF approximation

Primary Variable: \(Y_{\text{gp}}\).
Dependency Map: \(X_{\text{proj}}, \Omega, b, V \rightarrow \Phi \rightarrow Y_{\text{gp}}\).
Path to Final Output: \(Y_{\text{gp}} \rightarrow\) smooth kernel-like latent behavior -> converter extraction -> continuous nonlinear structure in X/y.

Rationale. Random Fourier features approximate kernel-style smooth nonlinearities at fixed computational cost. The widened gp family now keeps one output interface while varying both the frequency-sampling branch and the internal spectral variant, which broadens the kinds of smooth or oscillatory behavior that can appear without adding a new public config surface.

Uses \(P=256\) random features and two branches for sampling frequency matrix \(\Omega\). Core output:

\[\Phi = \cos(X_{\text{proj}} \Omega^\top + b), \quad Y = \frac{1}{\sqrt{P}} \Phi V^\top\]

where \(b\) is random phase and \(V\) is sampled from noise.

Variant-specific widening keeps the same readout contract:

standard: uses the base branch-specific frequency draw with the same cosine readout shown above.
multiscale: rescales one half of the random features toward a lower spectral band and the other half toward a higher spectral band before the cosine readout, increasing multi-resolution behavior.
periodic: applies sampled integer harmonics \(h_p \in \{1,\dots,5\}\) to the phase logits and uses a mixed cosine/sine readout \[\Phi_p = \frac{\cos(h_p \theta_p + b_p) + \sin(h_p \theta_p - b_p)}{\sqrt{2}}\] to bias the latent toward more explicitly oscillatory structure.

5.7 EM-style assignment

Primary Variable: \(Y_{\text{em}}\).
Dependency Map: centers \(c_k\), scales \(\sigma_k\), exponents \((p,q)\), and assignment probabilities \(P_{ik} \rightarrow Y_{\text{em}}\).
Path to Final Output: \(Y_{\text{em}} \rightarrow\) mixture-like local latent effects -> converter extraction -> locally structured observables in X/y.

Rationale. Soft center assignments add mixture-like local behavior and scale-sensitive responses. The linear readout over assignment probabilities keeps compatibility with common downstream handling.

Sample centers \(c_k\), scales \(\sigma_k\), and exponents \(p, q\):

Centers: \(K = \max(2, \lfloor \mathsf{LogUniform}(2, \max(16, 2 d_{\text{out}})) \rfloor)\)
Scales: \(\sigma_k = \exp(\text{noise} \cdot 0.1)\) (log-normal, concentrated near 1)
Exponents: \(p \sim \mathsf{LogUniform}(1, 4),\; q \sim \mathsf{LogUniform}(1, 2)\)

\[d_{ik} = \left(\sum_q |X_{iq} - c_{kq}|^p\right)^{1/p}\] \[\mathsf{logit}_{ik} = -\frac{1}{2} \log(2\pi\sigma_k^2) - \left(\frac{d_{ik}}{\max(\sigma_k, 10^{-6})}\right)^q\] \[P_{ik} = \mathsf{softmax}_k(\mathsf{logit}_{ik}), \quad Y = \mathsf{Linear}(P)\]

5.8 Product family

Primary Variable: \(Y_{\text{prod}}\).
Dependency Map: component family draws \(f(X)\) and \(g(X)\) from allowed support \(\rightarrow Y_{\text{prod}} = f(X)\odot g(X)\).
Path to Final Output: \(Y_{\text{prod}} \rightarrow\) multiplicative latent interactions -> converter extraction -> higher-order effects in final X/y.

Rationale. Multiplying two component mechanisms introduces multiplicative interactions and sharper nonlinear responses than additive composition alone. Restricting component candidates to vetted families maintains stability under family-mix masking.

Pick two component families \(f, g\) from allowed set (tree, discretization, gp, linear, quadratic) (filtered by enabled family mix), then:

\[Y = f(X) \odot g(X)\]

6. Converter Definitions

Motivation Converters are the emission boundary — they translate latent causal state into the tabular features a downstream model actually sees. Even if the latent mechanisms are diverse, monotonous converters produce monotonous observed distributions:

latent Z ──→ numeric converter ──→ observed feature column
             ├─ identity pass-through    → Gaussian-looking marginals
             ├─ Kumaraswamy warp         → skewed / beta-like marginals
             └─ power / log transform    → heavy-tailed or compressed marginals
latent Z ──→ categorical converter ──→ observed class labels
├─ argmax on logits         → hard cluster boundaries
├─ softmax + Categorical    → soft probabilistic assignment
└─ k-means style centers    → Voronoi-like class regions

The numeric converter's optional Kumaraswamy warp varies marginal distributions beyond the standard Gaussian shape; the categorical converter's multiple method/variant combinations produce diverse class-assignment patterns. Without converter diversity, even mechanistically diverse latent signals could produce tabular data where every numeric column looks Gaussian and every categorical column uses the same assignment logic — a subtle effective-diversity gap that is easy to overlook.

Current default semantic note. The default target path attaches a target converter to one selected latent DAG node rather than sampling y from a separate post-feature head.

Primary Variable: \((X'_s, v_s)\) (converter output state and extracted observable value for spec \(s\)).
Dependency Map: latent slice, converter kind, and converter parameters \(\rightarrow (X'_s, v_s)\).
Path to Final Output: extracted \(v_s\) values are assigned to feature slots and become emitted feature values; one selected latent node emits y through its target converter; later missingness may censor emitted X; \(X'_s\) is written back to latent for downstream coupling.

Rationale. Converters are separated from mechanism families because observable typing is an output-contract concern, not a latent-mechanism concern. This decoupling lets one latent pipeline serve multiple feature schemas.

6.1 Numeric converter

Primary Variable: \((X'_{\text{num}}, v_{\text{num}})\).
Dependency Map: input slice \(X\), warp parameters \((a,b)\), and identity/warp branch choice \(\rightarrow (X'_{\text{num}}, v_{\text{num}})\).
Path to Final Output: \(v_{\text{num}}\) is emitted as numeric feature value in X; \(X'_{\text{num}}\) feeds back into node latent state.

Rationale. The identity branch preserves raw latent structure when extra warping is unnecessary. The optional bounded Kumaraswamy-like warp provides controllable marginal shaping while staying numerically stable.

Given input slice \(X\) and raw extracted scalar \(v = X_{:,0}\):

with probability 0.5: identity output (no warp),
otherwise sample \(a,b \sim \mathsf{LogUniform}(0.2,5)\) and apply columnwise min-max + warp:

\[\hat{X} = \frac{X - X_{\min}}{\max(X_{\max} - X_{\min}, 10^{-6})}\] \[X' = 1 - \left(1 - \mathsf{clip}(\hat{X}, 0, 1)^a\right)^b\]

Converter returns \((X', v)\).

6.2 Categorical converter

Primary Variable: \((X'_{\text{cat}}, \ell)\) where \(\ell\) is categorical label output.
Dependency Map: method/variant choice, cardinality \(C\), centers or logits/probabilities \(\rightarrow (X'_{\text{cat}}, \ell)\).
Path to Final Output: \(\ell\) is emitted as categorical feature/class target in X/y; \(X'_{\text{cat}}\) is fed back into latent columns.

Rationale. Joint method-variant sampling separates class assignment logic from latent representation feedback into node state. This increases categorical behavior diversity while preserving a consistent (out, labels) converter contract.

A joint (method, variant) pair is sampled from:

neighbor: input, index_repeat, center, center_random_fn
softmax: input, index_repeat, softmax_points

Let target cardinality be \(C = \max(2, n_{\text{categories}})\).

Neighbor method:

\[\text{label}_i = \arg\min_k \sum_q |X_{iq} - c_{kq}|^p, \quad p \sim \mathsf{LogUniform}(0.5, 4)\]

Softmax method:

\[L = a \mathsf{standardize}(X_{\text{proj}}) + b, \quad a \sim \mathsf{LogUniform}(0.1, 10), \quad b_k = \ln(u_k + 10^{-4})\] \[\text{label}_i \sim \mathsf{Categorical}(\mathsf{softmax}(L_i))\]

Output representation \(X'\) is variant-dependent (input copy, repeated index, center lookup, center transformed by random function, or random softmax points). Labels are finally reduced modulo \(C\).

7. Noise Sampling and Runtime Selection

Motivation Noise is the stochastic axis of the prior, independent of mechanism complexity. Restricting noise to Gaussian produces datasets whose residuals always have light, symmetric tails — but real tabular residuals are often heavy-tailed (financial returns), asymmetric (truncated sensor readings), or heteroscedastic. The three primitive families span a tail-weight gradient:

Gaussian    →  light tails, exp(-x²/2)        → most benign residual regime
Laplace     →  moderate tails, exp(-|x|)       → double-exponential, sharper peak
Student-t   →  heavy tails, (1+x²/ν)^(-(ν+1)/2)  → occasional extreme outliers (ν > 2)
mixture     →  per-dataset draw from the above → corpus spans all three regimes

The mixture mode resolves one family per dataset attempt (Section 7.2), so a corpus generated with mixture naturally contains Gaussian-noise datasets, Laplace-noise datasets, and Student-t-noise datasets — broadening the noise axis of effective diversity without requiring separate generate runs. Noise diversity in the prior has been shown to improve performance on real datasets with non-Gaussian residual structure.

Primary Variable: \(\epsilon\) (active noise draw process for the attempt).
Dependency Map: noise family, scale, and family-specific parameters \(\rightarrow \epsilon\).
Path to Final Output: \(\epsilon\) perturbs points, matrices, weights, and explicit noise draws -> latent variability and converter inputs -> final X/y dispersion and noise metadata.

Rationale. Noise is modeled as its own axis so distributional perturbation can be controlled independently from mechanism complexity. Distinguishing primitive samplers from runtime selection clarifies where stochasticity enters.

7.1 Primitive family samplers

Primary Variable: \(\epsilon_{\text{family}}\) (draws from one resolved primitive family).
Dependency Map: selected primitive family + shape + scale (+ df or mixture weights when applicable) \(\rightarrow \epsilon_{\text{family}}\).
Path to Final Output: \(\epsilon_{\text{family}}\) is used by samplers/mechanisms across generation -> affects latent tensors and emitted X/y.

Rationale. Multiple primitive families cover light-tail, double-exponential, and heavy-tail regimes under one API surface. Explicit formulas keep distributional assumptions auditable across backends.

For \(\mathsf{sample\_noise}(\text{shape, family, scale, }\dots)\):

Gaussian: i.i.d. \(\mathcal{N}(0, 1)\)
Laplace (inverse CDF): \[U \sim \mathsf{Uniform}(\epsilon, 1-\epsilon), \quad X = \begin{cases} \ln(2U), & U < 0.5 \\ -\ln(2(1-U)), & U \ge 0.5 \end{cases}\]
Student-t: \[X = \frac{Z}{\sqrt{\chi^2_\nu / \nu}}, \quad Z \sim \mathcal{N}(0, 1), \quad \nu > 2\]
Mixture: sample component assignment per element from normalized weights over {gaussian, laplace, student_t} and draw from that component.

All outputs are scaled by scale.

7.2 Dataset-level noise-family resolution

Primary Variable: noise_spec (resolved attempt-level noise sampling specification).
Dependency Map: requested family, mixture weights, and deterministic seed stream \(\rightarrow\) runtime selection \(\rightarrow\) noise_spec.
Path to Final Output: noise_spec parameterizes all downstream noise draws in the attempt -> dataset-level distributional character in final X/y plus noise-distribution metadata.

Rationale. In mixture mode, sampling one family per dataset attempt preserves coherent dataset-level identity instead of per-element family switching. Seeded selection guarantees reproducibility and metadata traceability.

Generation runtime resolves one NoiseRuntimeSelection per dataset attempt:

If requested family is not mixture: sampled family equals requested family.
If requested family is mixture:
1. Normalize configured mixture weights.
2. Use deterministic keyed RNG stream KeyedRng(run_seed).keyed("dataset", i, "noise_runtime", "family_selection").torch_rng(...).
3. Sample exactly one component family for the dataset.

The resulting noise_spec uses that sampled family and base scale for all downstream draws in the attempt. Shift noise drift applies through multiplicative scale factor \(\gamma_\sigma\).