Thoughts

Computational Capabilities of Large Language Models: the Universal Approximation Theorem

J

02 Mar 2025 — 10 min read

Introduction

The emergence of Large Language Models (LLMs) has prompted fundamental questions about their computational capabilities within the theoretical landscape of computer science. While these models demonstrate remarkable linguistic abilities, their precise classification within the hierarchy of computational systems remains an area of active exploration. Simultaneously, the Universal Approximation Theorem (UAT) stands as a foundational result in neural network theory, establishing that even simple neural networks can approximate arbitrary continuous functions to any desired degree of accuracy. These seemingly disparate theoretical frameworks—computational theory for LLMs and function approximation theory for neural networks—intersect in ways that illuminate the fundamental nature and limitations of modern AI systems.

This essay explores the theoretical connections between LLMs' computational capabilities and the Universal Approximation Theorem, bridging computational theory, statistical learning, and functional analysis. By examining this relationship, we gain deeper insights into both the theoretical limits and practical capabilities of language models as computational systems.

Theoretical Foundations of LLMs as Computational Systems

Formal Characterization of LLMs

Large Language Models can be formally characterized as parameterized probability distributions over sequences of tokens. Given a sequence of tokens $x_{1:t} = (x_1, x_2, ..., x_t)$, an LLM defines a conditional probability distribution over the next token:

$$P(x_{t+1} | x_{1:t}; \theta)$$

where $\theta$ represents the model's parameters. From a computational perspective, this formulation places LLMs within the framework of stochastic transition systems, where the state is the current context window and transitions are governed by learned probability distributions.

The Transformer architecture that underlies modern LLMs implements a form of finite-state transduction with an extremely large state space, defined by operations such as:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

These operations, while complex, are fundamentally matrix multiplications and non-linear transformations—operations that fall within the class of efficiently computable functions.

Computational Limitations and Weak Turing Completeness

Within the Chomsky hierarchy of formal languages, LLMs occupy an interesting middle ground. Their fixed-length context window makes them theoretically less powerful than Turing machines, but their massive parameter space allows them to approximate aspects of more powerful computational models within certain bounds.

This has led to the concept of "weak Turing completeness" for LLMs, which can be formalized as follows:

For any Turing machine $M$ and input $x$, there exists a prompt $p_{M,x}$ such that the LLM $L$ produces an output $L(p_{M,x})$ describing the computation of $M$ on $x$, but with several constraints:

The simulation may be limited to a bounded number of steps
The simulation may have a non-zero probability of error
The complexity of $p_{M,x}$ may grow with the complexity of $M$ and $x$

This formulation captures the intuition that LLMs can simulate computational processes but with inherent limitations in accuracy and scale.

The Universal Approximation Theorem: Foundations and Extensions

Classical Formulation

The Universal Approximation Theorem, originally established by Cybenko (1989) for sigmoid activation functions and later extended by Hornik (1991) to general activation functions, states that a feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of $\mathbb{R}^n$, given certain conditions on the activation function.

Formally, let $\sigma: \mathbb{R} \rightarrow \mathbb{R}$ be a non-constant, bounded, and continuous activation function. For any continuous function $f: [0,1]^n \rightarrow \mathbb{R}$ and $\epsilon > 0$, there exists an integer $N$, real constants $v_i, b_i \in \mathbb{R}$ and real vectors $w_i \in \mathbb{R}^n$ for $i = 1, 2, ..., N$, such that we can define:

$$F(x) = \sum_{i=1}^{N} v_i \sigma(w_i^T x + b_i)$$

where $|F(x) - f(x)| < \epsilon$ for all $x \in [0,1]^n$.

Modern Extensions

The theorem has been extended in several important directions:

Depth-Width Tradeoffs: Lu et al. (2017) showed that deeper networks can achieve the same approximation capacity with exponentially fewer neurons than shallow networks for certain function classes.
ReLU Activation Functions: Numerous studies have confirmed that ReLU networks, which dominate modern deep learning, also satisfy universal approximation properties.
Function Spaces: Extensions to broader function spaces, including Lebesgue spaces ($L^p$ spaces), have been established, showing the approximation capabilities for non-continuous functions.
Transformer-Specific Results: Recent work has demonstrated that Transformer architectures, with their self-attention mechanisms, also possess universal approximation properties for sequence-to-sequence mappings.

Bridging LLMs and the Universal Approximation Theorem

LLMs as Function Approximators

The connection between LLMs and the Universal Approximation Theorem begins by recognizing that LLMs effectively implement a function approximation over the space of text sequences. Specifically, an LLM can be viewed as approximating a complex function:

$$f: \mathcal{X}^* \rightarrow \Delta(\mathcal{X})$$

where $\mathcal{X}^*$ is the set of all finite sequences over the token vocabulary $\mathcal{X}$, and $\Delta(\mathcal{X})$ is the probability simplex over $\mathcal{X}$.

From this perspective, the Universal Approximation Theorem suggests that with sufficient capacity, transformer-based LLMs can approximate any continuous function mapping from input sequences to probability distributions over output tokens. However, this raises several theoretical questions:

What is the appropriate function space for language modeling?
Can we meaningfully apply concepts like continuity to discrete token spaces?
How do we reconcile the UAT's focus on deterministic functions with the probabilistic nature of language models?

Embedding Spaces and Continuity

A key insight in connecting the UAT to LLMs comes from considering the embedding spaces in which tokens are represented. Modern LLMs embed discrete tokens into continuous high-dimensional spaces, where operations occur before projecting back to the discrete token space:

$$\text{Embedding}: \mathcal{X} \rightarrow \mathbb{R}^d$$
$$\text{Internal Computation}: \mathbb{R}^d \times \mathbb{R}^{nd} \rightarrow \mathbb{R}^d$$
$$\text{Output Projection}: \mathbb{R}^d \rightarrow \Delta(\mathcal{X})$$

In this continuous embedding space, the UAT directly applies. The transformer's feed-forward networks and attention mechanisms operate on these continuous representations, and the UAT guarantees they can approximate arbitrary continuous functions on this space.

This provides a theoretical foundation for understanding why LLMs can learn complex linguistic patterns: their architecture is provably capable of approximating arbitrary continuous functions in the embedding space, which then translate to complex dependencies between tokens in the discrete space.

The Kolmogorov-Arnold Representation Theorem Connection

A deeper theoretical connection emerges with the Kolmogorov-Arnold representation theorem, which states that any multivariate continuous function can be represented as a composition of continuous functions of a single variable and addition.

For a continuous function $f$ on $[0,1]^n$, there exist continuous univariate functions $\phi_{i,j}$ and $\psi_j$ such that:

$$f(x_1, x_2, ..., x_n) = \sum_{j=1}^{2n+1} \psi_j\left(\sum_{i=1}^n \phi_{i,j}(x_i)\right)$$

This connects to transformer architectures in a profound way: the self-attention mechanism in transformers can be viewed as implementing a form of this representation. The query, key, and value projections create univariate representations of input tokens, while the attention computation combines these through weighted sums.

Statistical Learning Theory Perspective

From a statistical learning theory perspective, the connection becomes even clearer. Vapnik-Chervonenkis (VC) theory establishes bounds on the learnability of function classes based on their complexity.

For neural networks, the VC dimension grows with the number of parameters, suggesting that larger networks can learn more complex function classes. LLMs, with their billions of parameters, have enormous VC dimensions, suggesting they can represent extremely complex function classes.

However, this raises a theoretical tension: while larger capacity enables better approximation (as guaranteed by the UAT), it also increases the risk of overfitting. The remarkable generalization capabilities of LLMs despite their massive parameter counts remains a theoretical puzzle, often referred to as the "deep learning generalization paradox."

Computational Boundaries and Implications

The Approximation-Computation Tradeoff

A critical insight from connecting LLMs to the UAT is understanding the fundamental tradeoff between approximation capability and computational efficiency.

The UAT guarantees that neural networks can approximate arbitrary continuous functions, but it makes no guarantees about:

The number of parameters required for approximation
The computational complexity of finding those parameters
The sample complexity required to learn the function from data

For LLMs, this translates to a fundamental tension: while they can theoretically approximate complex computational processes (as suggested by the weak Turing completeness), the resources required for perfect approximation may be prohibitively large.

Memory Limitations and State Representation

One of the most significant limitations of LLMs in light of the UAT is their bounded context window. While the UAT suggests they can approximate arbitrary functions on their input space, that input space is constrained to a fixed-length context window.

This connects to computational complexity theory through the concept of state complexity. Traditional Turing machines have unbounded memory (the tape), allowing them to represent arbitrarily complex states. LLMs, with their bounded context windows, have a fundamental limit on the complexity of states they can represent.

Mathematically, if we denote the context length as $L$ and vocabulary size as $|V|$, then an LLM can represent at most $|V|^L$ distinct states, a finite (though astronomically large) number. This places a hard bound on the computational processes they can precisely simulate, regardless of their approximation capability within that bound.

Implicit Regularization and Inductive Biases

The UAT establishes what neural networks can theoretically represent, but it doesn't address what they will learn in practice given finite data and optimization procedures.

Recent theoretical work suggests that gradient-based optimization imposes implicit regularization that biases neural networks toward simpler functions, even when they have the capacity to represent more complex ones. For LLMs, this manifests as a preference for statistically frequent patterns in the training data.

This connects to the concept of "minimum description length" in information theory: among all functions that fit the data, gradient descent appears to prefer those with simpler descriptions in the parameter space of neural networks.

Theoretical Extensions and Future Directions

Non-Asymptotic Approximation Theory

While the Universal Approximation Theorem guarantees that neural networks can approximate any continuous function given enough neurons, it is inherently an asymptotic result—it does not specify how many neurons, parameters, or how much training data are required to achieve a desired approximation error $\epsilon$. Modern research in non-asymptotic approximation theory seeks to bridge this gap by providing quantitative bounds on network size and complexity needed for a given level of accuracy.

For example, consider a function $f: [0,1]^d \rightarrow \mathbb{R}$ that is $n$-times continuously differentiable. Recent theoretical results indicate that to approximate $f$ within an error $\epsilon$ (measured in an appropriate norm, such as the $L^p$-norm), the required number of neurons $N(\epsilon)$ scales as:

$$N(\epsilon) = O\left(\epsilon^{-d/n}\right)$$

This relationship shows that functions with greater smoothness (i.e., higher $n$) can be approximated with significantly fewer neurons compared to less smooth functions, highlighting the dependency on both the function's intrinsic properties and the input dimension $d$.

In the context of Large Language Models (LLMs), these non-asymptotic bounds translate into questions about scaling laws: How do model performance, training data volume, and computational resources interact to yield improvements in language tasks? Empirical studies, such as those by Kaplan et al. (2020), suggest that performance often follows power-law trends relative to model size and data, yet a complete theoretical framework remains an active area of investigation.

Non-asymptotic approximation theory not only informs us about the minimal architectural requirements for achieving a target error but also provides insights into the trade-offs between model capacity, sample complexity, and computational efficiency. These insights are crucial for guiding practical model design, optimizing resource allocation, and advancing techniques in model compression and knowledge distillation.

In summary, while classical UAT confirms the existence of approximations, non-asymptotic results make these guarantees actionable by quantifying how large and complex a network must be to reach a prescribed level of accuracy, thereby bringing theoretical understanding closer to practical application.

Information-Theoretic Extensions

An information-theoretic perspective extends the Universal Approximation Theorem by not only asking which functions can be approximated but also how efficiently they can be represented in terms of bits. This approach quantifies the trade-off between the size of a model and its approximation error, and it provides a rigorous framework for understanding model compression and knowledge distillation.

Rate-Distortion Theory

Rate-distortion theory formalizes the trade-off between the rate (the number of bits needed to encode a model) and the distortion (the approximation error incurred). For a source random variable $X$ and its approximation $\hat{X}$, the rate-distortion function is defined as:

$$R(D) = \min_{p(\hat{x}|x) : \mathbb{E}[d(x, \hat{x})] \leq D} I(X;\hat{X})$$

where:

$d(x, \hat{x})$ is a distortion measure (such as squared error or a divergence measure),
$\mathbb{E}[d(x, \hat{x})]$ is the expected distortion,
$I(X;\hat{X})$ is the mutual information between $X$ and $\hat{X}$, representing the minimum number of bits required to encode the approximation,
$D$ is the maximum tolerable distortion.

This formulation quantifies the minimal information (in bits) needed to represent $X$ such that the average approximation error does not exceed $D$.

Application to LLMs: Model Compression and Knowledge Distillation

In the context of Large Language Models (LLMs), we can view the model as approximating a function:

$$f: \mathcal{X}^* \rightarrow \Delta(\mathcal{X})$$

where:

$\mathcal{X}^*$ denotes the set of all finite token sequences,
$\Delta(\mathcal{X})$ is the probability simplex over the vocabulary $\mathcal{X}$.

Let $f$ be the function computed by a large (teacher) LLM and $\hat{f}$ be the function represented by a smaller (student) model. A natural distortion metric between these functions is the Kullback-Leibler (KL) divergence:

$$d(f, \hat{f}) = D_{\mathrm{KL}}\Bigl(f(\cdot | x) ,\Big|, \hat{f}(\cdot | x)\Bigr)$$

Thus, the rate-distortion formulation for approximating the teacher model becomes:

$$R(D) = \min_{p(\hat{f}|f) : \mathbb{E}[D_{\mathrm{KL}}(f ,|, \hat{f})] \leq D} I(f; \hat{f})$$

This equation seeks the minimum number of bits required to encode the student model $\hat{f}$ such that the average divergence from the teacher model $f$ remains below a specified threshold $D$.

Knowledge distillation is essentially an instantiation of this problem: it aims to transfer the rich, high-dimensional knowledge from a teacher model into a more compact student model by balancing the model size (rate) against the allowable approximation error (distortion).

Theoretical Insights and Implications

Bit-Efficient Function Approximation: By extending the UAT with an information-theoretic framework, we can ask not just whether a function can be approximated, but also determine the minimal bit-rate required for a given level of accuracy.
Compression Limits: The rate-distortion function $R(D)$ provides a lower bound on the number of bits necessary to approximate the function within an error tolerance $D$. This bound informs us about the fundamental limits of model compression.
Trade-Offs in Model Design: In designing LLMs, there is an inherent trade-off between model complexity (and thus computational resources) and the accuracy of function approximation. The rate-distortion framework makes this trade-off explicit, guiding the development of efficient architectures.

By integrating rate-distortion theory into the analysis of LLMs, we bridge the gap between classical function approximation and modern practices in model compression and knowledge distillation. This not only deepens our theoretical understanding of neural network efficiency but also has practical implications for developing more compact and effective language models.

Towards a Computational Complexity Theory for LLMs

Building on the connection between LLMs and the UAT, we can begin to develop a computational complexity theory specifically for language models. This would classify problems based on:

The context length required to solve them
The parameter count required to achieve a specified error rate
The number of examples required to learn the underlying function

This framework would provide a more nuanced understanding of the capabilities and limitations of LLMs than traditional computational complexity classes.

Conclusion

The theoretical bridge between LLMs' computational capabilities and the Universal Approximation Theorem provides deep insights into both the power and limitations of these models. The UAT establishes that neural networks, including transformer architectures, can approximate arbitrary continuous functions, providing a theoretical foundation for LLMs' ability to learn complex linguistic patterns.

However, this approximation capability exists in tension with computational constraints: the bounded context window, finite parameter count, and limitations of gradient-based learning all constrain what LLMs can learn in practice.

This theoretical analysis suggests that LLMs occupy a unique position in computational theory: they are not fully Turing-complete, yet their massive parameter spaces allow them to approximate complex computational processes within certain bounds. Understanding this position requires bridging multiple theoretical frameworks, from computational theory to functional analysis to statistical learning theory.

As LLMs continue to evolve, this theoretical understanding will become increasingly important for predicting their capabilities, addressing their limitations, and developing the next generation of models. The connection between the UAT and LLMs' computational capabilities represents not just a theoretical curiosity, but a foundational framework for understanding artificial intelligence in the era of large language models.