Noise Diversification

Noise family selection, mixture modes, and per-dataset resolution.

Real-world tabular data rarely has Gaussian residuals. Financial returns exhibit fat tails; sensor readings show double-exponential error profiles; biological measurements often have heavy-tailed, heteroscedastic variation. A synthetic prior that restricts noise to Gaussian systematically undercovers these regimes — the foundation model never sees heavy-tailed stochasticity during pretraining, so its in-context learning on tasks with non-Gaussian noise must generalize from zero prior exposure.

Prior design realism — including noise diversity — materially affects tabular foundation model performance. dagzoo exposes four noise regimes so researchers can control and measure the noise axis of their prior independently from mechanism complexity:

Gaussian    →  light tails, symmetric     → the "easy" baseline
Laplace     →  sharper peak, heavier tails → double-exponential residuals
Student-t   →  heavy tails, occasional outliers (df > 2 for finite variance)
mixture     →  per-dataset draw from the above → corpus spans all three

Use noise-family workflows when you want non-Gaussian stochastic regimes while retaining deterministic seed behavior and explicit metadata reporting.


When to use

Why it matters for your prior

  • Your synthetic prior currently uses only Gaussian noise, which means every dataset’s residuals have light, symmetric tails — a coverage gap for the many real tasks with heavy-tailed or asymmetric noise.
  • You want to measure whether your model’s in-context learning is robust to the shape of noise (tail weight, kurtosis), not just noise magnitude.
  • You are investigating effective diversity and suspect Gaussian-only noise is a meta-feature coverage gap — coverage of weak meta-feature regimes has been shown to improve reliability.
  • You want to ablate the noise axis in isolation: hold mechanism complexity and graph structure constant, change only the noise family, and attribute downstream performance changes to noise distributional effects specifically.

Operational triggers

  • You want heavier-tail stochasticity than the Gaussian default.
  • You need deterministic comparisons across Gaussian/Laplace/Student-t regimes.
  • You want benchmark guardrails for runtime impact and metadata validity.

Supported families

  • gaussian: default Gaussian sampling. Light, symmetric tails — the most benign noise regime. P(|x| > 3σ) ≈ 0.3%.
  • laplace: double-exponential noise. Sharper peak and heavier tails than Gaussian — P(|x| > 3σ) ≈ 1.2%, about 4x more outliers than Gaussian at the same scale.
  • student_t: heavy-tailed Student-t. Requires df > 2 (finite variance). At df = 3, tail weight is substantial — roughly 10x more probability mass beyond 3σ than Gaussian. At df = 10, behavior approaches Gaussian. Lower df → heavier tails → more extreme outliers.
  • mixture: per-dataset weighted draw from the above three families. Each dataset in the corpus gets one family (resolved at generation time), so a corpus generated with mixture naturally spans all three noise regimes without separate generate runs.

Preset workflows

Generate smoke datasets for each family:

dagzoo generate --config configs/preset_noise_gaussian_generate_smoke.yaml --num-datasets 25 --out data/run_noise_gaussian
dagzoo generate --config configs/preset_noise_laplace_generate_smoke.yaml --num-datasets 25 --out data/run_noise_laplace
dagzoo generate --config configs/preset_noise_student_t_generate_smoke.yaml --num-datasets 25 --out data/run_noise_student_t
dagzoo generate --config configs/preset_noise_mixture_generate_smoke.yaml --num-datasets 25 --out data/run_noise_mixture

Benchmark guardrail smoke run:

dagzoo benchmark \
  --config configs/preset_noise_benchmark_smoke.yaml \
  --preset custom \
  --suite smoke \
  --no-memory \
  --out-dir benchmarks/results/smoke_noise

What to inspect

  • In-process DatasetBundle.metadata["noise_distribution"]:
    • noise_distribution.family_requested
    • noise_distribution.family_sampled
    • noise_distribution.sampling_strategy
    • noise_distribution.base_scale
    • noise_distribution.student_t_df
    • noise_distribution.mixture_weights (when requested family is mixture)
  • Benchmark summary preset_results[*].scenarios.noise:
    • metadata coverage/validity
    • sampled-family counts
    • runtime delta vs gaussian-noise control

For output details, see output-format.md.