Dagzoo Transform Reference
This document is the formal reference for the generation transforms used by
dagzoo. The equations are written to match the current runtime in
src/dagzoo.
Using This Reference
Use this page when you want the mathematical definitions behind the
generator. Each section covers one part of the runtime: DAG sampling, shift,
mechanism families, node execution, converters, and noise. Together they
describe how dagzoo maps a latent DAG into emitted tabular
features, targets, and metadata.
- Section 1 (DAG structure) — topology of causal dependencies
- Section 2 (Shift parameters) — train/test distribution drift
- Section 3 (Mechanism families) — functional complexity of variable relationships
- Section 4 (Node pipeline) — execution mechanics from latent to converter-ready state
- Section 5 (Mechanism definitions) — catalog of functional transform families
- Section 6 (Converters) — mapping from latent state to observed tabular features
- Section 7 (Noise) — stochastic variation character
The sections are organized so you can either read end to end or jump directly to the transform family you need. When you are checking a config or a runtime behavior, the practical question is: which transform changes, and how does that change show up in the emitted dataset?
Notation and Symbols
Primary Variable: Symbol map (this section's
notation table).
Dependency Map: all symbols used
in later equations -> definitions in this table.
Path to
Final Output: unambiguous symbols -> unambiguous equations
-> correct interpretation of how transforms produce X,
y, and metadata.
Rationale. This table fixes symbol meanings up front so equations stay unambiguous across sections. Symbol names mirror runtime naming in core modules, making the spec easier to trace back to code.
| Symbol | Meaning |
|---|---|
| \(N\) | Number of latent DAG nodes. |
| \(n\) | Number of rows sampled for one dataset attempt (\(n = n_{\text{train}} + n_{\text{test}}\)). |
| \(G \in {0,1}^{N \times N}\) | DAG adjacency matrix with convention G[src, dst]. |
| \(i, j\) | Node indices in \({0,\dots,N-1}\). |
| \(\mathsf{pa}(j)\) | Parent index set for node \(j\): \({i : G_{ij}=1}\). |
| \(\sigma(x)\) | Logistic sigmoid \((1 + e^{-x})^{-1}\). |
| \(A, B_i, C_j\) | DAG latent logits: \(A\) is global edge propensity, \(B_i\) is source-node effect, \(C_j\) is destination-node effect. |
| \(\beta_{\text{edge}}\) | Edge logit bias applied in DAG sampling. |
| \(\tau\) | Mechanism logit tilt (mechanism_logit_tilt). |
| \(\lambda_f\) | Raw mechanism family weight for family \(f\) from
mechanism.function_family_mix. |
| \(\pi_f^{(0)}\) | Baseline mechanism-family probability before logit tilt. |
| \(\pi_f\) | Final sampling probability for mechanism family \(f\). |
| \(\ell_f\) | Family base logit for family \(f\)
(from MECHANISM_FAMILY_BASE_LOGITS). |
| \(\tilde{\ell}_f\) | Centered family base logit: \(\ell_f - \frac{1}{\lvert\mathcal{F}\rvert}\sum_{g\in\mathcal{F}}\ell_g\). |
| \(\mathcal{F}\) | Ordered mechanism family set:
(nn, tree, discretization, gp, linear, quadratic, em, product, piecewise). |
| \(Z_j \in \mathbb{R}^{n \times d}\) | Node-\(j\) latent tensor after node pipeline transforms. |
| \(d_{\text{req}}\) | Required latent width from converter specs: \(\sum_s \max(1, d_s)\). |
| \(d_{\text{extra}}\) | Additional sampled latent width: \(\max(1, \lfloor \mathsf{LogUniform}(1,32) \rfloor)\). |
| \(d\) | Node latent width used for mechanism output: \(d = d_{\text{req}} + d_{\text{extra}}\). |
| \(d_s\) | Width requested by converter spec \(s\) before max-with-1 clamp. |
| \(w \in \mathbb{R}^d\) | Positive random weight vector (normalized to sum 1, then permuted). |
| \(X\) | Generic mechanism input matrix. |
| \(Y\) | Generic mechanism output matrix. |
| \(p_{ij}\) | DAG edge probability for ordered pair \((i,j)\) with \(i < j\). |
| \(\gamma_\sigma\) | Noise standard-deviation multiplier from shift runtime
(variance_sigma_multiplier). |
| \(\gamma_{\text{var}}\) | Noise variance multiplier: \(\gamma_{\text{var}} = \gamma_\sigma^2\). |
| \(\mathcal{D}_{\text{noise}}\) | Active runtime noise family (gaussian,
laplace, or student_t after runtime
selection). |
| \(C\) | Categorical cardinality for a converter target. |
| \(\mathsf{LSE}\) | LogSumExp along parent axis. |
| \(T\) | Number of trees in a sampled tree ensemble. |
| \(P\) | Number of random features in GP approximation (fixed at 256). |
| \(s_j\) | Final node-level multiplicative latent scale. |
| \(\mathcal{O}\) | Final attempt output tuple: \((X, y, \text{metadata})\). |
Operator Reference
| Operator | Definition | Input/Output Shape Semantics | Where Used | Output Impact |
|---|---|---|---|---|
| \(\mathsf{pa}(j)\) | Parent index set of node \(j\): \({i: G_{ij}=1}\). | Input: scalar node index \(j\) and adjacency \(G\); Output: set of parent indices. | Section 1 and Section 4. | Determines which upstream node tensors are composed for node \(j\). |
| \(\mathsf{Agg}(Y_1,\dots,Y_k)\) | Multi-parent aggregation operator sampled from \({\sum,\prod,\max,\mathsf{LSE}}\). | Input: parent-aligned tensors \((n \times d)\); Output: one \((n \times d)\) tensor. | Section 4.2. | Changes how parent effects combine before downstream mechanism/converter steps. |
| \(\mathsf{LSE}(Y_1,\dots,Y_k)\) | LogSumExp over parent axis: \(\log\sum_r e^{Y_r}\). | Input: stacked parent outputs; Output: aggregated tensor \((n \times d)\). | Section 4.2. | Smoothly emphasizes larger parent responses while retaining contributions from others. |
| \(\arg\min_k\) | Index of minimum value over candidate index \(k\). | Input: candidate scores/distances; Output: integer index per row. | Sections 5.5 and 6.2. | Produces center assignments that drive discrete or categorical structure. |
| \(\sum,\prod,\max\) | Sum/product/max reductions over specified axis. | Input: tensors with reduction axis; Output: reduced tensor. | Sections 4.2, 4.3, 5.2, 5.4, 5.7. | Controls interaction form (additive, multiplicative, extremal) in latent construction. |
| \(\mathsf{Converter}_s(V_s)\) | Converter function for spec \(s\), returning transformed slice and extracted value. | Input: latent view \(V_s\); Output: \((X'_s, v_s)\). | Section 4.3 and Section 6. | Directly emits observed feature values that become final
complete-data X; the target is then generated by the
conditional complete-data head, and optional missingness may later mask
the emitted features. |
| \(\mathsf{Linear}(\cdot)\) | Random linear map to the configured output width. | Input: matrix \((n \times d_{in})\); Output: matrix \((n \times d_{out})\). | Sections 5.5 and 5.7. | Projects intermediate representations into node latent channels used by converters. |
| \(\mathsf{leaf}(X)\) | Oblivious-tree leaf index determined by sampled splits/thresholds. | Input: row vector(s) and tree splits; Output: leaf id(s). | Section 5.4. | Selects leaf values that define piecewise latent behavior. |
| \(\mathsf{logit}_{ik}\) | Pre-softmax score for row \(i\), component/class \(k\). | Input: row-to-center distances and scale terms; Output: score matrix. | Section 5.7. | Determines assignment probabilities that shape EM-style latent outputs. |
| \(\mathsf{softmax}_k(\cdot)\) | Normalize exponentiated logits over index \(k\) to probabilities. | Input: logits per row; Output: probabilities summing to 1 across \(k\). | Sections 5.7 and 6.2. | Converts scores into stochastic assignments that drive categorical/mixture behavior. |
| \(\mathsf{sample\_noise}(\cdot)\) | Noise sampler with configurable family and scale. | Input: shape + family params; Output: random tensor with requested shape. | Section 7.1. | Injects stochastic variation into points, matrices, weights, and node outputs. |
| \(\mathsf{clip}(x,a,b)\) | Clamp each entry of \(x\) into \([a,b]\). | Input: tensor; Output: tensor with same shape. | Section 4.3. | Prevents extreme values from destabilizing downstream standardization/conversion. |
| \(\mathsf{nan\_to\_num}(x)\) | Replace NaN/Inf entries with finite values. | Input: tensor; Output: finite tensor with same shape. | Section 4.3. | Stabilizes latent values before weighting and conversion. |
| \(\mathsf{standardize}(x)\) | Column-wise centering and scaling to unit variance (with epsilon guard). | Input: matrix \((n \times d)\); Output: matrix \((n \times d)\). | Sections 4.3 and 6.2. | Controls scale so mechanism outputs/converters are numerically comparable. |
| \(\mathsf{Cauchy}(0,1)\) | Standard Cauchy distribution. | Input: location/scale; Output: scalar or tensor draws. | Section 1. | Provides heavy-tailed latent logits for heterogeneous graph connectivity. |
| \(\mathsf{Bernoulli}(p)\) | Bernoulli draw with success probability \(p\). | Input: probability \(p \in [0,1]\); Output: binary draw. | Section 1. | Converts edge probabilities into realized adjacency \(G\). |
| \(\mathsf{Uniform}(a,b)\) | Continuous uniform distribution on \([a,b]\). | Input: bounds \((a,b)\); Output: scalar/tensor draws. | Sections 4.1 and 7.1. | Drives root sampling and inverse-CDF noise construction. |
| \(\mathsf{UniformInt}\{m,\dots,n\}\) | Discrete uniform integer sampler on inclusive range. | Input: integer bounds; Output: integer draws. | Section 5.4. | Samples tree depths and related discrete structure choices. |
| \(\mathsf{LogUniform}(a,b)\) | Distribution with \(\log x\) uniform on \([\log a,\log b]\). | Input: positive bounds \((a,b)\); Output: positive scalar draws. | Sections 4.3, 5.5, 6.1, 6.2. | Controls scales/exponents over multiple orders of magnitude. |
| \(\mathsf{Categorical}(p)\) | Discrete draw from class probabilities \(p\). | Input: probability vector per row; Output: class/label index. | Section 6.2. | Emits categorical feature labels that populate final observed features. |
Conventions:
- Unless noted otherwise, operations are row-wise over batch dimension \(n\) and preserve row count.
- In \(\mathsf{Agg}\), the aggregation kind is sampled once per multi-parent composition call and then applied elementwise across parents.
- In noise sampling,
mixturein Section 7.1 samples component assignment per element, while Section 7.2 resolves one family per dataset attempt before downstream draws.
End-to-End Primary Variable Map
Primary Variable: \(\mathcal{O} = (X, y,
\text{metadata})\).
Dependency Map: \(s_{\text{shift}}, A, B_i, C_j, \lambda_f,
\tilde{\ell}_f, \text{noise\_spec}, \epsilon, G, \pi_f, Z_{\text{base}},
Z_{\text{comp}}, Y, Z_{\text{post}}, (X'_s, v_s) \rightarrow
\mathcal{O}\).
Path to Final Output: shift
runtime values define graph, mechanism, and noise controls; these drive
node latents and converter emissions that become \((X,y)\), while runtime selections and
graph/mechanism/noise summaries become metadata in \(\mathcal{O}\).
Rationale. This section is the single whole-picture view that ties section-level primary variables into one output contract. It complements the per-section formal details by making global dataflow explicit.
Pipeline skeleton:
\[s_{\text{shift}} \rightarrow (\beta_{\text{edge}}, \tau, \gamma_{\sigma}, \gamma_{\text{var}})\] \[\left(\beta_{\text{edge}}, A, B_i, C_j\right) \rightarrow G\] \[\left(\lambda_f, \tau, \tilde{\ell}_f\right) \rightarrow \pi_f\] \[\text{noise\_spec} \rightarrow \epsilon_{\text{family}} \rightarrow \epsilon\] \[\left(G, \pi_f, \epsilon, Z_{\text{base}}, Z_{\text{comp}}, Y, Z_{\text{post}}\right) \rightarrow {Z_j}_{j=0}^{N-1}\] \[\left({Z_j}, \mathsf{Converter}_s\right) \rightarrow {(X'_s, v_s)}_s \rightarrow (X, y)\] \[\left(G, \pi_f, \text{noise\_spec}, s_{\text{shift}}\right) \rightarrow \text{metadata}\] \[\mathcal{O} = (X, y, \text{metadata})\]
Operator role notes:
- \(\mathsf{Agg}\) is the bridge from multiple parent tensors to \(Z_{\text{comp}}\) (Section 4.2).
- \(\mathsf{Converter}_s\) is the bridge from latent state to emitted observable values (Section 4.3 and Section 6).
Primary variable crosswalk:
| Section | Primary Variable | Immediate Inputs | Immediate Output | Contribution to \(\mathcal{O}\) |
|---|---|---|---|---|
| 1. DAG Structure Sampling | \(G\) | \(A, B_i, C_j, \beta_{\text{edge}}\) | Parent sets \(\mathsf{pa}(j)\) | Controls execution order and parent flow that shapes \((X,y)\) and graph metadata. |
| 2. Shift Runtime Parameters | \(s_{\text{shift}}\) | Resolved shift config | \((\beta_{\text{edge}}, \tau, \gamma_{\sigma}, \gamma_{\text{var}})\) | Couples drift controls into graph density, mechanism usage, and noise magnitude in outputs/metadata. |
| 3. Mechanism Family Sampling | \(\pi_f\) | \(\lambda_f, \tau, \tilde{\ell}_f, \mathcal{F}\) | Family draw distribution per mechanism call | Determines which transform family generates latent structure before conversion to \((X,y)\). |
| 4. Node Generation Pipeline | \(Z_j\) | Parent/root inputs, width terms, sampled mechanisms, weights, scale | Node-level latent tensor | Shared latent state from which converter slices emit observed features/targets. |
| 4.1 Root-node base sampling | \(Z_{\text{base}}\) | Base-kind selection, distribution params, \(\mathcal{D}_{\text{noise}}\) | Initial latent input for root mechanisms | Seeds root latent variability that propagates to downstream nodes and final observables. |
| 4.2 Parent composition | \(Z_{\text{comp}}\) | Parent tensors, concat/aggregation path, \(\mathsf{Agg}\) | Composed latent input for node transform | Defines multi-parent interaction pattern that feeds mechanism outputs used in \((X,y)\). |
| 4.3 Post-processing and slicing | \(Z_{\text{post}}\) | Raw latent, sanitize/standardize, weighting, padding, slice writeback | Converter-ready latent state | Directly feeds converter extraction and latent feedback that determines emitted values. |
| 5. Mechanism Definitions | \(Y = f(X)\) | Mechanism parameters and input \(X\) | Mechanism output tensor | Family-level nonlinear map that populates latent channels later converted to outputs. |
| 5.1 Linear | \(Y_{\text{linear}}\) | \(X, M\) | Linear latent channels | Baseline low-complexity latent effects converted into numeric/categorical outputs. |
| 5.2 Quadratic | \(Y_{\text{quad}}\) | \(X_{\text{sub}}, X_{\text{aug}}, M_t\) | Interaction-heavy latent channels | Adds pairwise interaction structure reflected in emitted features/targets. |
| 5.3 Neural network | \(Y_{\text{nn}}\) | \(X, {M_{\ell}}, {\phi_{\ell}}\) | Deep nonlinear latent channels | Produces high-capacity nonlinear signals passed through converters into \((X,y)\). |
| 5.4 Tree ensemble | \(Y_{\text{tree}}\) | Splits, thresholds, leaf values | Piecewise latent channels | Introduces discontinuous region-based structure visible in output distributions. |
| 5.5 Discretization | \(Y_{\text{disc}}\) | Centers, distance exponent, nearest-center assignment | Clustered latent channels | Creates quantized/cluster-like effects in converted observables. |
| 5.6 GP/RFF | \(Y_{\text{gp}}\) | \(X_{\text{proj}}, \Omega, b, V\) | Smooth kernel-like latent channels | Adds smooth nonlinear components that converters emit as continuous structure. |
| 5.7 EM assignment | \(Y_{\text{em}}\) | Centers, scales, exponents, assignment probabilities | Mixture-like latent channels | Encodes local mixture behavior that propagates to final emitted values. |
| 5.8 Product family | \(Y_{\text{prod}}\) | Two component family outputs \(f(X), g(X)\) | Multiplicative latent channels | Adds higher-order multiplicative effects before converter extraction. |
| 6. Converter Definitions | \((X'_s, v_s)\) | Latent slice, converter kind/params | Updated latent slice plus extracted value | Defines the emission boundary from latent state to observed feature values. |
| 6.1 Numeric converter | \((X'_{\text{num}}, v_{\text{num}})\) | Numeric slice, warp params, branch choice | Numeric extracted value and feedback slice | Produces numeric entries in \((X,y)\) and latent feedback for downstream coupling. |
| 6.2 Categorical converter | \((X'_{\text{cat}}, \ell)\) | Method/variant, cardinality, logits/centers | Categorical label and feedback slice | Produces categorical feature/class outputs and associated latent feedback. |
| 7. Noise Sampling and Selection | \(\epsilon\) | Family, scale, family-specific params | Runtime noise draws | Injects stochastic variation throughout latent generation and converter inputs. |
| 7.1 Primitive family samplers | \(\epsilon_{\text{family}}\) | Primitive family choice and parameters | Family-specific noise samples | Supplies concrete perturbation draws used by samplers/mechanisms. |
| 7.2 Dataset-level family resolution | noise_spec | Requested family, mixture weights, seeded RNG | One attempt-level family spec | Determines dataset-level noise identity reflected in outputs and metadata. |
1. DAG Structure Sampling
Motivation The graph structure is the causal skeleton that determines which variables influence which. Consider three corpora:
Corpus A (Erdős–Rényi, p=0.3): every DAG looks like a medium-density mesh
→ all datasets have ~similar dependency depth
Corpus B (Erdős–Rényi, varied p): sparser and denser meshes, but still homogeneous
per-node — no hub nodes, no isolated chains
Corpus C (Cauchy latents): some DAGs are star-shaped (one hub drives everything),
some are sparse chains (A→B→C→D), some are dense webs
→ the prior covers qualitatively different causal regimes
Corpus C is what this section produces. The Cauchy distribution's heavy tails
mean that occasionally one node's latent \(B_i\)
is extreme, creating a natural hub; another node's \(C_j\)
may be deeply negative, making it nearly parentless. This heterogeneity is
deliberate — real tabular data comes from causal structures that range from
near-independent features to deeply entangled causal chains, and the prior must
cover that full range. The edge_logit_bias
(\(\beta_{\text{edge}}\)) shifts the overall
density surface up or down, acting as the "how connected" knob for the graph
axis of effective diversity.
Primary Variable: \(G\) (DAG adjacency matrix).
Dependency Map: \(A, B_i,
C_j, \beta_{\text{edge}} \rightarrow p_{ij} \rightarrow
G_{ij}\).
Path to Final Output: \(G \rightarrow \mathsf{pa}(j)\) for each
node -> node execution order and parent inputs -> latent tensors
-> emitted X, y, and graph
metadata.
Rationale. The latent-logit Cauchy construction yields heterogeneous edge probabilities while strict upper-triangular masking guarantees acyclicity. Writing \(p_{ij}\) and \(\beta_{\text{edge}}\) explicitly keeps graph-drift behavior inspectable and testable.
sample_dag draws a strict upper-triangular adjacency
matrix.
\[A \sim \mathsf{Cauchy}(0,1), \quad B_i \sim \mathsf{Cauchy}(0,1), \quad C_j \sim \mathsf{Cauchy}(0,1)\]
Interpretation:
- \(A\): global intercept-like term shifting all edge logits together.
- \(B_i\): source-node effect for node \(i\) (how strongly node \(i\) tends to emit edges).
- \(C_j\): destination-node effect for node \(j\) (how strongly node \(j\) tends to receive edges).
For each \(i < j\):
\[p_{ij} = \sigma(A + B_i + C_j + \beta_{\text{edge}}), \quad G_{ij} \sim \mathsf{Bernoulli}(p_{ij})\]
and for \(i \ge j\), \(G_{ij}=0\).
2. Shift Runtime Parameters
Motivation A model trained on i.i.d. synthetic data has never seen train/test mismatch during pretraining. At inference, real tasks routinely exhibit shift — and the model has no prior experience to fall back on. These three shift axes map to distinct real-world drift patterns:
graph_scale → causal structure itself changes (e.g., new regulations alter which
between train and test variables are causally relevant)
mechanism_scale → functional relationships change (e.g., customer behavior shifts while
while the graph stays the same the same features remain relevant)
variance_scale → stochastic variation changes (e.g., sensor degradation increases
in magnitude measurement noise over time)
Because the three axes are independent, a researcher can ablate each in isolation: hold graph and mechanism fixed while sweeping noise drift to measure noise-robustness, or hold noise fixed while tilting mechanism complexity. Shift-aware constructions in the prior have been shown to improve robustness to temporal distribution shifts. The deterministic mappings from config scales to runtime coefficients (\(\beta_{\text{edge}}\), \(\tau\), \(\gamma_\sigma\)) keep each axis interpretable and independently testable.
Primary Variable: \(s_{\text{shift}} = (\text{graph\_scale},
\text{mechanism\_scale}, \text{variance\_scale})\).
Dependency Map: \(s_{\text{shift}} \rightarrow (\beta_{\text{edge}},
\tau, \gamma_{\sigma}, \gamma_{\text{var}})\) via deterministic
runtime mappings.
Path to Final Output: \(\beta_{\text{edge}} \rightarrow G\), \(\tau \rightarrow \pi_f\), and \(\gamma_{\sigma} \rightarrow\) noise
amplitude; these three paths converge in node generation and sampling to
shape final X, y, and shift metadata.
Rationale. Shift controls are defined as deterministic mappings from config scales to runtime coefficients so drift remains interpretable. Separating graph, mechanism, and variance mappings keeps each drift axis analyzable in isolation.
With resolved shift scales
(graph_scale, mechanism_scale, variance_scale):
\[\beta_{\text{edge}} = \ln(2) \cdot \text{graph\_scale}\] \[\tau = \text{mechanism\_scale}\] \[\gamma_\sigma = \exp\left(\frac{\ln(2)}{2} \cdot \text{variance\_scale}\right), \quad \gamma_{\text{var}} = \gamma_\sigma^2 = 2^{\text{variance\_scale}}\]
So positive variance_scale increases global noise
variance multiplicatively.
3. Mechanism Family Sampling with Mix + Tilt
Motivation Real tabular relationships span a wide functional spectrum. A prior restricted to, say, linear + neural-network mechanisms will never produce piecewise decision boundaries (tree-like), cluster-assignment patterns (discretization), or smooth periodic surfaces (GP) — and the foundation model trained on it will struggle when encountering those patterns at inference time.
linear → y ≈ Wx + b (low complexity, smooth)
quadratic → y ≈ x'Ax + Wx + b (pairwise interactions)
neural network → y = σ(W₂ σ(W₁x + b₁) + b₂) (high-capacity smooth nonlinearity)
tree ensemble → y = Σ leaf_values[split_path] (piecewise constant, discontinuous)
discretization → y = f(nearest_center(x)) (clustered, Voronoi-like)
GP / RFF → y = V cos(Ωx + b) (smooth kernel, possibly periodic)
EM assignment → y = Σ p(k|x) · μ_k (soft mixture, local structure)
product → y = f(x) · g(x) (multiplicative interaction)
piecewise → y = staged_branches(x) (explicit control path)
The function_family_mix controls which families are
present (support) and their baseline probabilities. The
mechanism_logit_tilt (\(\tau\))
controls how much the distribution shifts toward nonlinear families
within that support. Together they form the most direct lever for changing the
functional complexity profile of the prior. If your model underperforms on
tasks with tree-like decision
boundaries, increasing tree weight in the mix and measuring the downstream
effect is the first experiment to try.
Primary Variable: \(\pi_f\) (mechanism-family sampling
distribution).
Dependency Map: \(\lambda_f, \tau, \tilde{\ell}_f, \mathcal{F}
\rightarrow \pi_f\).
Path to Final Output:
\(\pi_f \rightarrow\) sampled mechanism
family at each function draw -> latent transform type and
nonlinearity -> converted observable values in
X/y and nonlinear-mass-related
metadata.
Rationale. This parameterization separates family
support control (function_family_mix) from directional
drift (mechanism_logit_tilt). Centered logits make tilt a
relative reweighting mechanism rather than an arbitrary global scaling
change.
Let raw family weights be:
\[\lambda_f = \begin{cases} 1, & \text{if no function\_family\_mix} \\ \text{configured positive weight}, & \text{otherwise} \end{cases}\]
Normalize over positive-weight families:
\[\pi_f^{(0)} = \frac{\lambda_f}{\sum_{g \in \mathcal{F}} \lambda_g}\]
\(\pi_f^{(0)}\) is the baseline
family distribution after applying function_family_mix
support/weights and before any tilt.
If \(\tau \le 0\), then \(\pi_f = \pi_f^{(0)}\). If \(\tau > 0\), logits are tilted using centered family base logits:
\[\pi_f \propto \exp\left(\ln \pi_f^{(0)} + \tau \tilde{\ell}_f\right)\]
then normalized to sum to 1.
Configuration intuition:
mechanism.function_family_mixcontrols support and baseline weights through \(\lambda_f\).shift.mechanism_scalesets \(\tau\) (Section 2), which reweights within that support.- Increasing \(\tau\) shifts mass
toward families with higher \(\tilde{\ell}_f\) and away from families
with lower \(\tilde{\ell}_f\), changing
mechanism usage frequencies that shape final
X/y. - Example: with mix
{nn: 0.7, linear: 0.3}, \(\tau=0\) gives \(\pi^{(0)}_{nn}=0.7,\pi^{(0)}_{linear}=0.3\); with \(\tau>0\), \(nn\) gains additional mass because its centered base logit is higher than linear's.
Base logits used in \(\tilde{\ell}_f\) (ONLY IF \(\tau > 0\)):
| Family \(f\) | \(\ell_f\) |
|---|---|
nn | 0.7 |
tree | 0.7 |
discretization | 0.5 |
gp | 0.5 |
linear | -0.8 |
quadratic | -0.6 |
em | -0.3 |
product | 0.9 |
Rationale for Non-uniform Base Logits
The base logits define the direction of mechanism drift, not the default family distribution.
- When
mechanism_logit_tilt <= 0, family probabilities are just normalized family weights (uniform when nofunction_family_mixis provided), so base logits have no effect. - When
mechanism_logit_tilt > 0, centered base logits reweight probabilities toward families with larger centered logits and away from families with smaller centered logits. - In the current defaults, this shifts mass toward nonlinear families
(
nn,tree,discretization,gp,product) and away from simpler families (linear,quadratic), increasing nonlinear mechanism mass under positive tilt. function_family_mixstill controls support and baseline weights; base logits only reweight within enabled families.
Implementation note: product is only valid when at least
one of its component families is enabled (tree,
discretization, gp, linear,
quadratic).
4. Node Generation Pipeline
Motivation The node pipeline is where latent causal structure becomes concrete tensors. Each step — root sampling, parent composition, sanitization, weighting, converter slicing — adds a controlled layer of variation. The design is ordered so that:
root nodes → fresh latent geometry (normal, uniform, unit-ball, correlated)
child nodes → inherit + transform parent signals via multi-parent composition
all nodes → sanitize → standardize → weight → slice for converters → scale
Multiple root-base geometries (Section 4.1) ensure that even before any mechanism fires, latent inputs already vary in shape — uniform grids, spherical point clouds, correlated ellipsoids. Multi-parent composition (Section 4.2) controls how upstream causal effects combine: additive (sum), multiplicative (product), extremal (max), or smooth-extremal (logsumexp). These choices determine the interaction patterns visible in the final tabular data — a sum aggregation produces additive parent effects, while a product aggregation creates multiplicative interactions that no single-parent mechanism can mimic.
Primary Variable: \(Z_j\) (node-\(j\) latent tensor after pipeline
transforms).
Dependency Map: root/parent inputs,
mechanism draws, width terms \((d_{\text{req}}, d_{\text{extra}})\),
random weights \(w\), and scale \(s_j \rightarrow Z_j\).
Path to
Final Output: \(Z_j
\rightarrow\) converter slices, extracted feature values, and one
selected target-node conversion -> final
X/y.
Rationale. Node generation is documented as an ordered contract because converter extraction and metadata semantics depend on stable execution order. The split into root sampling, parent composition, and latent refinement mirrors the actual runtime path.
4.1 Root-node base sampling
Primary Variable: \(Z_{\text{base}}\).
Dependency
Map: base-kind choice + distribution parameters + active noise
family \(\mathcal{D}_{\text{noise}}
\rightarrow Z_{\text{base}}\).
Path to Final
Output: \(Z_{\text{base}} \rightarrow
f(Z_{\text{base}})\) (initial node latent) -> downstream node
pipeline -> feature converter extraction ->
X/y.
Rationale. Multiple root base geometries increase latent diversity before any parent effects exist. Passing every sampled base through \(f(\cdot)\) keeps root and non-root nodes aligned under the same mechanism-family controls.
For root nodes (\(\mathsf{pa}(j) = \varnothing\)), one base kind is chosen uniformly from:
normal: \(Z_{\text{base}} \sim \mathcal{D}_{\text{noise}}^{n \times d}\).uniform: elementwise \(\mathsf{Uniform}(-1,1)\).unit_ball: for each row, \[v \sim \mathcal{N}(0, I_d), \quad u \sim \mathsf{Uniform}(0,1), \quad z = \frac{v}{|v|_2} u^{1/d}\]normal_cov: \[E \sim \mathcal{D}_{\text{noise}}^{n \times d}, \quad A \sim \mathcal{D}_{\text{noise}}^{d \times d}, \quad Z_{\text{base}} = (E \odot w^\top) A^\top\] where \(w\) is sampled bysample_random_weights(a positive, sum-to-one, randomly permuted weight vector following a noisy power-law decay).
Then a random mechanism is applied:
\[Z_j = f(Z_{\text{base}})\]
4.2 Parent composition (non-root nodes)
Primary Variable: \(Z_{\text{comp}}\) (composed parent latent
input to node transform).
Dependency Map: parent
tensors \({Z_{p_r}}\) + composition
path choice + aggregation operator \(\mathsf{Agg} \rightarrow
Z_{\text{comp}}\).
Path to Final Output:
\(Z_{\text{comp}} \rightarrow\) node
mechanism output -> updated \(Z_j\)
-> feature converter extraction plus target-node conversion ->
emitted X/y.
Rationale. The 50/50 concat-versus-aggregation branch injects both feature-mixing and commutative interaction patterns. This broadens functional regimes without changing the node interface contract.
If \(|\mathsf{pa}(j)|=1\), apply one random mechanism to that parent tensor.
If \(|\mathsf{pa}(j)| \ge 2\), choose one path with equal probability:
- Concatenation path: \[Z_{\text{in}} = [Z_{p_1} | Z_{p_2} | \dots | Z_{p_k}], \quad Z_j = f(Z_{\text{in}})\]
- Aggregation path: \[Y_r = f_r(Z_{p_r}), \quad Z_j = \mathsf{Agg}(Y_1, \dots, Y_k)\] with \(\mathsf{Agg}\) sampled uniformly from: \[\sum_r Y_r, \quad \prod_r Y_r, \quad \max_r Y_r, \quad \mathsf{LSE}(Y_1, \dots, Y_k)\]
Concatenation semantics (implementation-faithful):
- If parent tensors are \(Z_{p_r} \in \mathbb{R}^{n \times d_r}\), concatenation is along feature axis: \[Z_{\text{in}} \in \mathbb{R}^{n \times \sum_{r=1}^k d_r}.\]
- The concatenated tensor is passed through one random mechanism \(f(\cdot)\) with node output width target
\(d\) (the node
total_dimfrom Section 4.3), so: \[f: \mathbb{R}^{n \times \sum_r d_r} \rightarrow \mathbb{R}^{n \times d}.\] - In this branch, aggregation kind is not used;
Aggonly applies in the non-concatenation branch.
Example:
- Three parents with shapes \((n,4)\), \((n,6)\), \((n,3)\) concatenate to \((n,13)\).
- If node target width is \(d=9\), the mechanism maps \((n,13)\rightarrow(n,9)\) before post-processing and converter slicing.
4.3 Latent post-processing and converter slicing
Primary Variable: \(Z_{\text{post}}\) (post-processed latent
tensor used for conversion).
Dependency Map: raw
node latent + sanitize/standardize operations + random weights \(w\) + normalization + padding noise +
slice/writeback updates \(\rightarrow
Z_{\text{post}}\).
Path to Final Output:
\(Z_{\text{post}} \rightarrow\)
per-spec extracted values \((v_s)\) and
rewritten latent slices -> assigned feature outputs plus one selected
target-node conversion -> final X/y.
Rationale. Sanitization, standardization, weighting, and norm scaling are applied before conversion to keep numerics stable across mechanism families. The slice-and-writeback loop formalizes how observable values are extracted while preserving latent continuity.
Current default semantic note. In the shipped default
prior, both feature converters and the target converter are attached to
sampled latent nodes. One selected latent node emits y
directly, and optional missingness acts later as an observation process
over emitted features only.
For each node, converter specs define required width:
\[d_{\text{req}} = \sum_s \max(1, d_s), \quad d = d_{\text{req}} + d_{\text{extra}}, \quad d_{\text{extra}} = \max(1, \lfloor \mathsf{LogUniform}(1, 32) \rfloor)\]
After mechanism output:
\[Z_j \leftarrow \mathsf{clip}(\mathsf{nan\_to\_num}(Z_j), -10^6, 10^6)\] \[Z_j \leftarrow \mathsf{standardize}(Z_j)\] \[Z_j \leftarrow Z_j \odot w^\top\] \[Z_j \leftarrow \frac{Z_j}{\max\left(\frac{1}{n} \sum_{r=1}^n |Z_{j,r:}|_2, 10^{-6}\right)}\]
Converter loop semantics:
- Slice per-spec view \(V_s\) from current cursor over columns.
- Pad missing columns with sampled noise if needed.
- Apply converter: \((X'_s, v_s) = \mathsf{Converter}_s(V_s)\).
- Write \(X'_s\) back into the same slice (shape-padded/truncated as needed).
- Store \(v_s\) as extracted observable value for the feature key.
Final node scaling:
\[Z_j \leftarrow s_j Z_j, \quad s_j \sim \mathsf{LogUniform}(0.1, 10)\]
5. Mechanism Family Definitions
Motivation This catalog is intentionally broad because each family occupies a qualitatively distinct region of function space. Narrowing the catalog narrows the prior:
Remove trees → prior cannot produce sharp, axis-aligned decision boundaries
Remove GP → prior cannot produce smooth periodic or multi-scale surfaces
Remove quadratic → prior loses explicit pairwise feature interactions
Remove product → prior loses multiplicative cross-family effects
The breadth of this catalog is what makes mechanism diversity a meaningful axis of effective diversity. Each family's parameterization (random weights, depths, kernel frequencies, etc.) adds within-family variation on top of the between-family structural differences. For a researcher evaluating whether to add or remove a family, the question is: does this family produce latent structure that no other family can approximate? If yes, removing it creates a coverage gap; if a proposed new family largely duplicates an existing one, adding it increases code surface without improving diversity.
Primary Variable: \(Y =
f(X)\) (family-specific mechanism output).
Dependency
Map: mechanism-family parameters + input \(X \rightarrow Y\).
Path to
Final Output: \(Y
\rightarrow\) node latent updates and feedback slices ->
converter-extracted observables -> final
X/y.
Rationale. The mechanism catalog intentionally spans
smooth, piecewise, assignment-based, and compositional transforms to
vary functional complexity under drift and mix controls. Each family
maps to a concrete sampler implementation in
random_functions.py.
Inputs to mechanism families are first sanitized and standardized.
5.1 Linear
Primary Variable: \(Y_{\text{linear}}\).
Dependency
Map: \(X, M \rightarrow
Y_{\text{linear}} = X M^\top\).
Path to Final
Output: \(Y_{\text{linear}}
\rightarrow\) latent columns for the node -> downstream
converter extraction -> numeric/categorical outputs in X
or y.
Rationale. Linear maps act as a low-complexity baseline and calibration point for drift against nonlinear families. Row-normalized random matrices preserve directional mixing while limiting runaway scale.
\[Y = X M^\top\]
where \(M\) is a sampled random matrix.
5.2 Quadratic
Primary Variable: \(Y_{\text{quad}}\).
Dependency
Map: \(X_{\text{sub}},
X_{\text{aug}}, M_t \rightarrow Y_{\text{quad}}\).
Path to Final Output: \(Y_{\text{quad}} \rightarrow\)
interaction-heavy latent signal -> converter extraction ->
nonlinear structure in emitted features and in any downstream target-node
ancestry.
Rationale. Quadratic forms add pairwise interactions not available in pure linear maps. Subspace capping and bias augmentation balance expressivity with numerical stability and compute control.
Let \(X_{\text{sub}}\) be either all columns or a random 20-column subset if input width exceeds 20. Augment with bias column: \(X_{\text{aug}} = [X_{\text{sub}}, \mathbf{1}]\). For each output channel \(t\):
\[Y_t = \mathsf{diag}(X_{\text{aug}}, M_t, X_{\text{aug}}^\top)\]
or equivalently per row: \(y_i = x_{\text{aug},i}^\top M_t, x_{\text{aug},i}\) — a general (not necessarily symmetric) quadratic form. The bias augmentation means this captures quadratic + linear + constant terms.
5.3 Neural network
Primary Variable: \(Y_{\text{nn}}\).
Dependency
Map: \(X\), sampled layer
matrices \({M_\ell}\), and sampled
activations \({\phi_\ell}\) \(\rightarrow Y_{\text{nn}}\).
Path to Final Output: \(Y_{\text{nn}} \rightarrow\) high-capacity
latent states -> converter extraction -> complex nonlinear
observables in X/y.
Rationale. Random-depth MLPs provide high-capacity nonlinear transforms with stochastic activation structure. Optional pre/post activation steps increase shape diversity while retaining a simple sampled-layer pipeline.
A random depth/width MLP with number of weight matrices sampled uniformly from \({1, 2, 3}\) (0–2 hidden layers) and hidden width from \(\lfloor \mathsf{LogUniform}(1, 127) \rfloor\):
\[Y^{(0)} = X \text{ (optional pre-activation with prob 0.5)}\] \[Y^{(\ell+1)} = \phi_\ell(Y^{(\ell)} M_\ell^\top)\]
with activation after hidden layers, optional final activation (prob 0.5).
Activation sampler includes fixed and parametric families.
- Fixed activations:
tanh,leaky_relu,elu,identity,silu,swiglu,relu,relu_sq,softplus,sign,gauss,exp,sin,square,abs,softmax,logsigmoid,logabs,sigmoid,round,mod1,selu,relu6,hardtanh,indicator_01,onehot_argmax,argsort,rank. - Parametric families:
relu_pow,signed_pow,inv_pow,poly,gumbel_softmax.
gumbel_softmax applies row-wise Gumbel perturbations followed by
temperature-scaled softmax. When activation standardization is enabled later in
the pipeline, those simplex-like row sums are no longer guaranteed to remain
exactly 1 in the emitted latent state.
5.4 Tree ensemble (oblivious decision trees)
Primary Variable: \(Y_{\text{tree}}\).
Dependency
Map: split features, thresholds, leaf indices, and leaf values
\(\rightarrow Y_{\text{tree}}\).
Path to Final Output: \(Y_{\text{tree}} \rightarrow\) piecewise
latent regions -> converter extraction -> discontinuous/tabular
decision patterns in X/y.
Rationale. Oblivious trees introduce thresholded piecewise behavior that complements smooth families. Variance-weighted split sampling biases random partitions toward informative dimensions without requiring supervised fitting.
Number of trees: \(T = \max(1, \lfloor \mathsf{LogUniform}(1, 32) \rfloor)\).
For each tree \(t\):
- Depth \(d_t \sim \mathsf{UniformInt}\{1, \dots, 7\}\).
- Sample split feature indices (variance-weighted probabilities when available).
- Sample thresholds by indexing sampled training rows.
- Compute leaf index via packed split bits.
- Sample leaf values \(V_t \in \mathbb{R}^{2^{d_t} \times d_{\text{out}}}\) from noise sampler.
Prediction is averaged:
\[Y = \frac{1}{T} \sum_{t=1}^T V_t[\mathsf{leaf}(X)]\]
5.5 Discretization
Primary Variable: \(Y_{\text{disc}}\).
Dependency
Map: sampled centers + distance exponent \(p\) + nearest-center index + linear map
\(\rightarrow Y_{\text{disc}}\).
Path to Final Output: \(Y_{\text{disc}} \rightarrow\) clustered
latent representation -> converter extraction ->
quantized/cluster-like observables in
X/y.
Rationale. Nearest-center assignment creates clustered discontinuities and quantization effects that smooth mechanisms do not capture. A final linear map preserves shared output-shape contracts across families.
Number of centers: \(K = \mathsf{clamp}(\lfloor \mathsf{LogUniform}(2, 128) \rfloor, 2, n)\).
Sample centers from rows, compute \(L_p\)-style distances (with sampled \(p \sim \mathsf{LogUniform}(0.5, 4)\)), assign nearest center, then apply linear map:
\[k^{\ast}(i) = \arg\min_k \sum_q |X_{iq} - c_{kq}|^p, \quad Y = \mathsf{Linear}(c_{k^{\ast}(i)})\]
5.6 GP/RFF approximation
Primary Variable: \(Y_{\text{gp}}\).
Dependency
Map: \(X_{\text{proj}}, \Omega, b, V
\rightarrow \Phi \rightarrow Y_{\text{gp}}\).
Path to
Final Output: \(Y_{\text{gp}}
\rightarrow\) smooth kernel-like latent behavior -> converter
extraction -> continuous nonlinear structure in
X/y.
Rationale. Random Fourier features approximate
kernel-style smooth nonlinearities at fixed computational cost. The widened
gp family now keeps one output interface while varying both the
frequency-sampling branch and the internal spectral variant, which broadens
the kinds of smooth or oscillatory behavior that can appear without adding a
new public config surface.
Uses \(P=256\) random features and two branches for sampling frequency matrix \(\Omega\). Core output:
\[\Phi = \cos(X_{\text{proj}} \Omega^\top + b), \quad Y = \frac{1}{\sqrt{P}} \Phi V^\top\]
where \(b\) is random phase and \(V\) is sampled from noise.
Variant-specific widening keeps the same readout contract:
standard: uses the base branch-specific frequency draw with the same cosine readout shown above.multiscale: rescales one half of the random features toward a lower spectral band and the other half toward a higher spectral band before the cosine readout, increasing multi-resolution behavior.periodic: applies sampled integer harmonics \(h_p \in \{1,\dots,5\}\) to the phase logits and uses a mixed cosine/sine readout \[\Phi_p = \frac{\cos(h_p \theta_p + b_p) + \sin(h_p \theta_p - b_p)}{\sqrt{2}}\] to bias the latent toward more explicitly oscillatory structure.
5.7 EM-style assignment
Primary Variable: \(Y_{\text{em}}\).
Dependency
Map: centers \(c_k\), scales
\(\sigma_k\), exponents \((p,q)\), and assignment probabilities \(P_{ik} \rightarrow Y_{\text{em}}\).
Path to Final Output: \(Y_{\text{em}} \rightarrow\) mixture-like
local latent effects -> converter extraction -> locally structured
observables in X/y.
Rationale. Soft center assignments add mixture-like local behavior and scale-sensitive responses. The linear readout over assignment probabilities keeps compatibility with common downstream handling.
Sample centers \(c_k\), scales \(\sigma_k\), and exponents \(p, q\):
- Centers: \(K = \max(2, \lfloor \mathsf{LogUniform}(2, \max(16, 2 d_{\text{out}})) \rfloor)\)
- Scales: \(\sigma_k = \exp(\text{noise} \cdot 0.1)\) (log-normal, concentrated near 1)
- Exponents: \(p \sim \mathsf{LogUniform}(1, 4),\; q \sim \mathsf{LogUniform}(1, 2)\)
\[d_{ik} = \left(\sum_q |X_{iq} - c_{kq}|^p\right)^{1/p}\] \[\mathsf{logit}_{ik} = -\frac{1}{2} \log(2\pi\sigma_k^2) - \left(\frac{d_{ik}}{\max(\sigma_k, 10^{-6})}\right)^q\] \[P_{ik} = \mathsf{softmax}_k(\mathsf{logit}_{ik}), \quad Y = \mathsf{Linear}(P)\]
5.8 Product family
Primary Variable: \(Y_{\text{prod}}\).
Dependency
Map: component family draws \(f(X)\) and \(g(X)\) from allowed support \(\rightarrow Y_{\text{prod}} = f(X)\odot
g(X)\).
Path to Final Output: \(Y_{\text{prod}} \rightarrow\)
multiplicative latent interactions -> converter extraction ->
higher-order effects in final X/y.
Rationale. Multiplying two component mechanisms introduces multiplicative interactions and sharper nonlinear responses than additive composition alone. Restricting component candidates to vetted families maintains stability under family-mix masking.
Pick two component families \(f, g\)
from allowed set
(tree, discretization, gp, linear, quadratic) (filtered by
enabled family mix), then:
\[Y = f(X) \odot g(X)\]
6. Converter Definitions
Motivation Converters are the emission boundary — they translate latent causal state into the tabular features a downstream model actually sees. Even if the latent mechanisms are diverse, monotonous converters produce monotonous observed distributions:
latent Z ──→ numeric converter ──→ observed feature column
├─ identity pass-through → Gaussian-looking marginals
├─ Kumaraswamy warp → skewed / beta-like marginals
└─ power / log transform → heavy-tailed or compressed marginals
latent Z ──→ categorical converter ──→ observed class labels
├─ argmax on logits → hard cluster boundaries
├─ softmax + Categorical → soft probabilistic assignment
└─ k-means style centers → Voronoi-like class regions
The numeric converter's optional Kumaraswamy warp varies marginal distributions beyond the standard Gaussian shape; the categorical converter's multiple method/variant combinations produce diverse class-assignment patterns. Without converter diversity, even mechanistically diverse latent signals could produce tabular data where every numeric column looks Gaussian and every categorical column uses the same assignment logic — a subtle effective-diversity gap that is easy to overlook.
Current default semantic note. The default target path
attaches a target converter to one selected latent DAG node rather than
sampling y from a separate post-feature head.
Primary Variable: \((X'_s, v_s)\) (converter output state
and extracted observable value for spec \(s\)).
Dependency Map:
latent slice, converter kind, and converter parameters \(\rightarrow (X'_s, v_s)\).
Path to Final Output: extracted \(v_s\) values are assigned to feature
slots and become emitted feature values; one selected latent node emits
y through its target converter; later missingness may censor
emitted X; \(X'_s\) is written back to latent for
downstream coupling.
Rationale. Converters are separated from mechanism families because observable typing is an output-contract concern, not a latent-mechanism concern. This decoupling lets one latent pipeline serve multiple feature schemas.
6.1 Numeric converter
Primary Variable: \((X'_{\text{num}},
v_{\text{num}})\).
Dependency Map: input
slice \(X\), warp parameters \((a,b)\), and identity/warp branch choice
\(\rightarrow (X'_{\text{num}},
v_{\text{num}})\).
Path to Final Output:
\(v_{\text{num}}\) is emitted as
numeric feature value in X; \(X'_{\text{num}}\) feeds back into node
latent state.
Rationale. The identity branch preserves raw latent structure when extra warping is unnecessary. The optional bounded Kumaraswamy-like warp provides controllable marginal shaping while staying numerically stable.
Given input slice \(X\) and raw extracted scalar \(v = X_{:,0}\):
- with probability 0.5: identity output (no warp),
- otherwise sample \(a,b \sim \mathsf{LogUniform}(0.2,5)\) and apply columnwise min-max + warp:
\[\hat{X} = \frac{X - X_{\min}}{\max(X_{\max} - X_{\min}, 10^{-6})}\] \[X' = 1 - \left(1 - \mathsf{clip}(\hat{X}, 0, 1)^a\right)^b\]
Converter returns \((X', v)\).
6.2 Categorical converter
Primary Variable: \((X'_{\text{cat}}, \ell)\) where \(\ell\) is categorical label output.
Dependency Map: method/variant choice, cardinality
\(C\), centers or logits/probabilities
\(\rightarrow (X'_{\text{cat}},
\ell)\).
Path to Final Output: \(\ell\) is emitted as categorical
feature/class target in X/y; \(X'_{\text{cat}}\) is fed back into
latent columns.
Rationale. Joint method-variant sampling separates
class assignment logic from latent representation feedback into node
state. This increases categorical behavior diversity while preserving a
consistent (out, labels) converter contract.
A joint (method, variant) pair is sampled from:
neighbor:input,index_repeat,center,center_random_fnsoftmax:input,index_repeat,softmax_points
Let target cardinality be \(C = \max(2, n_{\text{categories}})\).
Neighbor method:
\[\text{label}_i = \arg\min_k \sum_q |X_{iq} - c_{kq}|^p, \quad p \sim \mathsf{LogUniform}(0.5, 4)\]
Softmax method:
\[L = a \mathsf{standardize}(X_{\text{proj}}) + b, \quad a \sim \mathsf{LogUniform}(0.1, 10), \quad b_k = \ln(u_k + 10^{-4})\] \[\text{label}_i \sim \mathsf{Categorical}(\mathsf{softmax}(L_i))\]
Output representation \(X'\) is variant-dependent (input copy, repeated index, center lookup, center transformed by random function, or random softmax points). Labels are finally reduced modulo \(C\).
7. Noise Sampling and Runtime Selection
Motivation Noise is the stochastic axis of the prior, independent of mechanism complexity. Restricting noise to Gaussian produces datasets whose residuals always have light, symmetric tails — but real tabular residuals are often heavy-tailed (financial returns), asymmetric (truncated sensor readings), or heteroscedastic. The three primitive families span a tail-weight gradient:
Gaussian → light tails, exp(-x²/2) → most benign residual regime
Laplace → moderate tails, exp(-|x|) → double-exponential, sharper peak
Student-t → heavy tails, (1+x²/ν)^(-(ν+1)/2) → occasional extreme outliers (ν > 2)
mixture → per-dataset draw from the above → corpus spans all three regimes
The mixture mode resolves one family per dataset attempt
(Section 7.2), so a corpus generated with mixture naturally
contains Gaussian-noise datasets, Laplace-noise datasets, and Student-t-noise
datasets — broadening the noise axis of effective diversity without requiring
separate generate runs. Noise diversity in the prior has been shown to improve
performance on real datasets with non-Gaussian residual structure.
Primary Variable: \(\epsilon\) (active noise draw process for
the attempt).
Dependency Map: noise family, scale,
and family-specific parameters \(\rightarrow
\epsilon\).
Path to Final Output: \(\epsilon\) perturbs points, matrices,
weights, and explicit noise draws -> latent variability and converter
inputs -> final X/y dispersion and noise
metadata.
Rationale. Noise is modeled as its own axis so distributional perturbation can be controlled independently from mechanism complexity. Distinguishing primitive samplers from runtime selection clarifies where stochasticity enters.
7.1 Primitive family samplers
Primary Variable: \(\epsilon_{\text{family}}\) (draws from one
resolved primitive family).
Dependency Map:
selected primitive family + shape + scale (+ df or mixture weights when
applicable) \(\rightarrow
\epsilon_{\text{family}}\).
Path to Final
Output: \(\epsilon_{\text{family}}\) is used by
samplers/mechanisms across generation -> affects latent tensors and
emitted X/y.
Rationale. Multiple primitive families cover light-tail, double-exponential, and heavy-tail regimes under one API surface. Explicit formulas keep distributional assumptions auditable across backends.
For \(\mathsf{sample\_noise}(\text{shape, family, scale, }\dots)\):
- Gaussian: i.i.d. \(\mathcal{N}(0, 1)\)
- Laplace (inverse CDF): \[U \sim \mathsf{Uniform}(\epsilon, 1-\epsilon), \quad X = \begin{cases} \ln(2U), & U < 0.5 \\ -\ln(2(1-U)), & U \ge 0.5 \end{cases}\]
- Student-t: \[X = \frac{Z}{\sqrt{\chi^2_\nu / \nu}}, \quad Z \sim \mathcal{N}(0, 1), \quad \nu > 2\]
- Mixture: sample component assignment per element
from normalized weights over
{gaussian, laplace, student_t}and draw from that component.
All outputs are scaled by scale.
7.2 Dataset-level noise-family resolution
Primary Variable: noise_spec (resolved
attempt-level noise sampling specification).
Dependency
Map: requested family, mixture weights, and deterministic seed
stream \(\rightarrow\) runtime
selection \(\rightarrow\)
noise_spec.
Path to Final Output:
noise_spec parameterizes all downstream noise draws in the
attempt -> dataset-level distributional character in final
X/y plus noise-distribution metadata.
Rationale. In mixture mode, sampling one family per dataset attempt preserves coherent dataset-level identity instead of per-element family switching. Seeded selection guarantees reproducibility and metadata traceability.
Generation runtime resolves one NoiseRuntimeSelection
per dataset attempt:
- If requested family is not
mixture: sampled family equals requested family. - If requested family is
mixture:- Normalize configured mixture weights.
- Use deterministic keyed RNG stream
KeyedRng(run_seed).keyed("dataset", i, "noise_runtime", "family_selection").torch_rng(...). - Sample exactly one component family for the dataset.
The resulting noise_spec uses that sampled family and
base scale for all downstream draws in the attempt. Shift noise drift
applies through multiplicative scale factor \(\gamma_\sigma\).