Geometric Adam: A Ray Tracing-Inspired Approach to Neural Network Optimization
Based on the paper: Geometric Adam: Ray Tracing-Inspired Adaptive Optimization by Jaepil Jeong, Cognica Inc.
Introduction
In the pursuit of stable neural network optimization, we often look to mathematics for inspiration. But what if the answer lies in physics? This research essay explores Geometric Adam, a novel optimizer that draws inspiration from ray tracing in computer graphics to address fundamental challenges in deep learning optimization.
The core insight is deceptively simple: just as light refracts when passing through media of different densities, perhaps our optimization algorithm should adapt its behavior based on the local geometry of the loss landscape. This physical analogy led to remarkable empirical results and exposed fundamental questions about how neural networks actually learn.
The Problem: When Standard Optimizers Break
Our journey began with a practical challenge. While training a 29-million parameter transformer on WikiText-2, both Adam and AdamW catastrophically diverged after just 6 epochs. This wasn't a simple hyperparameter tuning issue—the failure was systematic and reproducible across various configurations.
The stark numbers tell the story:
- Adam: Validation perplexity exploded from 417.38 to 786.0
- AdamW: Similar divergence from 404.83 to 423.9
- Training completion rate: Only 20% across multiple attempts
This motivated us to rethink optimization from first principles, leading to an unexpected source of inspiration: the physics of light propagation.
The Core Innovation: Geometric Adaptation
Theoretical Foundation
The key insight comes from treating gradient descent as light propagation through varying media. We introduce three fundamental concepts:
Definition 1 (Gradient Direction): For gradient $g_t$ at step $t$:
$$d_t = \frac{g_t}{|g_t| + \epsilon}$$
Definition 2 (Angular Change): The angular change between consecutive gradients:
$$\theta_t = \arccos(|d_t \cdot d_{t-1}|)$$
Definition 3 (Refraction Coefficient): Inspired by Snell's law:
$$r_t = \exp(-\lambda\theta_t)$$
where $\lambda > 0$ is the refraction sensitivity parameter.
The Geometric Adam Algorithm
The complete update rule becomes:
$$\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t \\ v_t &= \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \\ \kappa_t &= \gamma\kappa_{t-1} + (1-\gamma)\frac{\theta_t}{|\hat{m}_t| + \epsilon} \\ \hat{m}t &= \frac{m_t}{(1-\beta_1^t)(1 + \kappa_t r_t)} \\ \theta_t &= \theta{t-1} - \alpha r_t \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \end{aligned}$$
The algorithm elegantly combines momentum-based updates with geometric adaptation through the refraction coefficient $r_t$.
Experimental Results: Beyond Expectations
The empirical results exceeded our theoretical predictions:
Primary Metrics (29M Model)
- Training Completion: 100% vs 20% for baselines
- Final Validation Perplexity: 115.6 (59% improvement over best baseline)
- Training Stability: 30 epochs of monotonic improvement
- Computational Overhead: 3.2× (addressable through hardware optimization)
The Angular Regime Discovery
Perhaps the most significant finding was the operating regime of our optimizer:
$$\text{Average Angular Change: } \bar{\theta} = 1.48 \text{ radians } (84.8°)$$
This places us firmly in what we term the "large-angle regime," far outside traditional optimization theory's assumptions of small perturbations.
Theoretical Analysis: When Theory Meets Reality
The Small-Angle Approximation Breakdown
Our initial theory assumed small angular changes, allowing the approximation:
$$\theta \approx \sqrt{2(1-\cos(\theta))}$$
However, at $\theta = 1.48$ rad, this approximation yields:
$$\theta_{\text{approx}} = \sqrt{2(1-\cos(85°))} = 1.35 \text{ rad}$$
This creates a 9% direct error, which compounds to approximately 21% systematic underestimation of curvature due to the quadratic relationship $\kappa \propto \theta^2/\alpha$.
Why Does It Still Work?
The answer lies in understanding optimization as relative change detection rather than absolute measurement. We prove:
Theorem (Robustness to Systematic Bias): Let $\tilde{\kappa}_t = c \cdot \kappa_t$ be a biased curvature estimate with constant multiplicative bias $c \in (0,1)$. The relative curvature signal:
$$\rho_t = \frac{\kappa_t - \kappa_{t-1}}{\kappa_{t-1} + \epsilon}$$
satisfies $\tilde{\rho}_t \approx \rho_t$ for sufficiently small $\epsilon$ relative to $c\kappa_{t-1}$.
This explains why our 21% systematic underestimation doesn't prevent effective optimization—the algorithm responds to relative changes in landscape geometry rather than requiring exact curvature values.
Scale-Dependent Behavior: A Surprising Discovery
Our experiments across three model scales revealed fascinating scale-dependent properties:
Model Size | GA Performance | Stability Gain | Angular Behavior |
---|---|---|---|
29M | 56% better | 24 epochs (5×) | θ̄ = 1.48 ± 0.31 |
10M | 15% worse | 37 epochs (3.3×) | θ̄ = 1.47 ± 0.29 |
2.5M | 43% worse | 86 epochs (6.1×) | θ̄ = 1.45 ± 0.28 |
The consistency of angular dynamics across scales (p > 0.05 for all pairwise comparisons) suggests that the large-angle regime is a fundamental characteristic of neural network optimization, not an artifact of model size.
The Perplexity Paradox
Our most counterintuitive finding emerged from cross-scale quality analysis. Consider these generated text samples:
10M Adam (PPL=108.95):
"the federal government officially decided to create a minor federal government in 1889-1991..."
2.5M Geometric Adam (PPL=147.77):
"the first leaders having been buried with the [unk]. he said he also possessed the eastern end of the annual sign..."
Despite 4× fewer parameters and 36% worse perplexity, the 2.5M Geometric Adam model produces more semantically coherent text. This challenges our fundamental assumptions about the relationship between perplexity metrics and generation quality.
Convergence Analysis
We establish convergence guarantees under appropriate conditions:
Theorem (Convergence in Strongly Convex Case): For $\mu$-strongly convex and $L$-smooth objectives, Geometric Adam with learning rate $\alpha \leq 1/L$ and refraction sensitivity $\lambda \in (0,2)$ achieves:
$$\mathbb{E}[L(\theta_t) - L(\theta^*)] \leq \rho^t[L(\theta_0) - L(\theta^*)]$$
where $\rho < 1$ depends on the geometric adaptation parameters.
The proof follows standard analysis with effective learning rate $\alpha r_{\min}$, where $r_{\min}$ is the minimum refraction coefficient.
Hypotheses and Conjectures
Hypothesis 1: The Large-Angle Advantage
We conjecture that large angular changes in gradient directions are not a bug but a feature of successful optimization. Our hypothesis is that these dramatic directional shifts indicate the optimizer is actively exploring the loss landscape rather than getting trapped in local structures. This suggests that future optimizers should be designed to encourage rather than suppress large angular changes.
Mathematical Conjecture: For non-convex neural network loss landscapes, there exists a critical angle $\theta_c \approx \pi/2$ such that optimizers maintaining $\bar{\theta} > \theta_c$ achieve better global minima than those with $\bar{\theta} < \theta_c$.
Hypothesis 2: Scale-Dependent Phase Transition
The dramatic performance shift between model scales suggests a phase transition in optimization dynamics. We hypothesize that there exists a critical model size $N_c$ where the loss landscape topology fundamentally changes:
$$\text{Optimization Difficulty} \propto \begin{cases} N^{\alpha}, & N < N_c \text{ (smooth regime)} \\ e^{\beta N}, & N > N_c \text{ (chaotic regime)} \end{cases}$$
Our data suggests $N_c$ lies between 2.5M and 10M parameters for transformer architectures.
Hypothesis 3: The Perplexity-Quality Divergence
The perplexity paradox points to a deeper truth about neural network representations. We conjecture that geometric optimization methods create "robust representations" that prioritize semantic consistency over probabilistic accuracy. This leads to our Robust Representation Hypothesis:
Models trained with geometric adaptation develop internal representations that are more resistant to semantic drift, even at the cost of higher perplexity.
This could explain why our 2.5M model generates more coherent text despite worse metrics.
Hypothesis 4: Implicit Regularization Through Refraction
We suspect the refraction mechanism implements a form of implicit regularization that standard optimizers lack. Specifically, we conjecture that:
$$\text{Effective Loss} = L(\theta) + \lambda \int_{\text{trajectory}} |\nabla^2 L|_F , dt$$
where the integral captures the accumulated curvature along the optimization path. This would explain the superior generalization despite training for more epochs.
Hypothesis 5: Hardware-Accelerated Future
Modern GPUs contain specialized ray tracing cores designed for exactly the operations our algorithm requires. We hypothesize that hardware-optimized implementations could reduce our 3.2× overhead to near parity with standard Adam, making geometric methods practical for production use.
Speculation: Future AI accelerators might include dedicated "geometric optimization units" that compute angular changes and refraction coefficients in hardware.
Future Directions: Reflection Mechanisms
Building on our refraction framework, we propose extensions incorporating reflection inspired by computer graphics lighting models:
$$r_t = g_t - 2(g_t \cdot n_t)n_t$$
where $n_t$ is the estimated surface normal. This could provide additional mechanisms for escaping saddle points and navigating complex loss landscapes.
Conjecture: Recursive Reflection Depth
We conjecture that allowing recursive reflections up to depth $k$ could exponentially accelerate saddle point escape:
$$T_{\text{escape}} \sim O(\text{poly}(d)/k)$$
compared to $O(\text{poly}(d))$ for standard methods, where $d$ is the dimensionality.
Implications for the Field
1. Rethinking Optimization Theory
The success in the large-angle regime suggests we need new theoretical frameworks that account for dramatic gradient changes rather than assuming small perturbations.
2. Physics-Inspired Algorithm Design
Physical analogies can lead to robust algorithms even without perfect mathematical correspondence. The key is capturing the right qualitative behaviors.
3. Beyond Perplexity Metrics
Our results demonstrate that traditional metrics may not capture what makes models actually useful. This has profound implications for model evaluation and selection.
Additional Hypotheses and Theoretical Speculations
Hypothesis 6: The Universality of Geometric Principles
We speculate that geometric adaptation principles extend beyond neural networks. Similar mechanisms might benefit:
- Reinforcement learning (policy gradient oscillations)
- Evolutionary algorithms (fitness landscape navigation)
- Molecular dynamics simulations (energy surface exploration)
Bold Conjecture: Any optimization problem in high-dimensional spaces with rugged landscapes could benefit from geometric adaptation mechanisms.
Hypothesis 7: Biological Plausibility
The robustness to measurement errors mirrors biological systems. We hypothesize that biological neural networks might employ similar geometric adaptation:
$$\text{Synaptic Learning Rate} \propto \exp(-\lambda \cdot \text{Neural Activity Coherence})$$
This could explain why biological learning is remarkably stable despite noisy environments.
Hypothesis 8: The Optimization Spectrum
We propose that optimization algorithms exist on a spectrum:
Precision ← → Robustness
- Newton's Method ← Quasi-Newton ← Adam ← Geometric Adam ← Stochastic Methods
Our work suggests the field has over-indexed on precision at the expense of robustness.
Hypothesis 9: Emergent Behaviors at Scale
The scale-dependent results hint at emergent optimization phenomena. We conjecture that as models approach brain-scale (∼100B parameters), entirely new optimization dynamics might emerge, potentially requiring:
- Hierarchical geometric adaptation
- Multi-scale refraction coefficients
- Quantum-inspired superposition of optimization paths
Hypothesis 10: The Information-Theoretic View
We hypothesize that successful optimization maximizes mutual information between consecutive gradients while minimizing redundancy:
$$\text{Optimization Quality} \propto \frac{I(g_t; g_{t-1})}{\text{Computational Cost}}$$
Geometric Adam might unconsciously optimize this ratio through its angular change mechanism.
Practical Considerations
When to Use Geometric Adam
- Large models where stability is paramount
- Tasks where training failures are costly
- Scenarios requiring long training schedules
- When final quality matters more than training speed
Implementation Notes
# Basic usage
optimizer = GeometricAdam(
model.parameters(),
lr=0.001,
betas=(0.9, 0.999),
refraction_sensitivity=0.1, # λ
curvature_memory=0.95 # γ
)
Full implementation available at: https://github.com/jaepil/geometric-adam
Conclusion
Geometric Adam demonstrates that successful optimization algorithms don't always require perfect theory. By drawing inspiration from ray tracing and embracing geometric adaptation, we achieved remarkable stability improvements while exposing fundamental questions about how neural networks learn.
The large-angle regime discovery challenges decades of optimization theory built on small-perturbation assumptions. Yet our algorithm thrives in this regime, suggesting that robust geometric principles may be more important than precise mathematical models.
Final Speculations: The Future of Optimization
We close with several provocative conjectures about where this research might lead:
The Multi-Physics Hypothesis: Future optimizers might combine multiple physical analogies:
- Refraction (Geometric Adam) for curvature adaptation
- Diffusion for exploration
- Quantum tunneling for escaping local minima
- Thermodynamics for temperature scheduling
The Consciousness Connection: The large-angle regime might relate to theories of consciousness that emphasize discontinuous state transitions. Could the "aha moments" in human learning correspond to large angular changes in some abstract gradient space?
The Simplicity Principle: Perhaps the most successful optimization algorithms of the future will be those that embrace simplicity and robustness over mathematical sophistication. The fact that a simple exponential refraction mechanism outperforms complex second-order methods suggests we may have been looking in the wrong direction.
As we continue to push the boundaries of model scale and complexity, approaches like Geometric Adam that naturally adapt to loss landscape geometry may become increasingly essential. Sometimes, the path forward requires looking at old problems through new lenses—even if those lenses come from computer graphics rather than optimization textbooks.
The journey from physical intuition to practical algorithm reveals a deeper truth: the most elegant solutions often emerge when we're willing to question fundamental assumptions and follow unexpected analogies to their logical conclusions.
Open Questions for the Community
- Is there a fundamental reason why θ ≈ π/2 appears optimal across scales?
- Can we develop a complete theory of large-angle optimization?
- What other physical phenomena might inspire better optimizers?
- Is the perplexity paradox pointing to a need for entirely new evaluation metrics?
- Could hardware-software co-design unlock orders of magnitude improvements?
The answers to these questions may reshape our understanding of optimization, learning, and perhaps intelligence itself.
This research was conducted using PyTorch and the WikiText-2 dataset on Apple M1 Max hardware. The full paper is available at OSF Preprints. Reproducible experiments and implementation details are available at the GitHub repository.