From Bayesian Inference to Neural Computation: The Inevitable Emergence of Neurons in Probabilistic Search
"The theory of probabilities is at bottom nothing but common sense reduced to calculus."
— Pierre-Simon Laplace, Théorie analytique des probabilités, 1812
1. Introduction
In a previous essay, we explored how BM25 scores can be transformed into calibrated probability estimates through a sigmoid-based Bayesian framework, and how the parameters of this transformation can be progressively estimated without labeled data. We showed that the sigmoid function is not an arbitrary engineering choice but a mathematical necessity — the inevitable consequence of applying Bayes' theorem under reasonable distributional assumptions.
In our recent paper, Bayesian BM25: A Probabilistic Framework for Hybrid Text and Vector Search, we formalized this framework and extended it to hybrid search, enabling principled combination of lexical and semantic signals through probabilistic score fusion.
This essay takes the next step. We show that when multiple calibrated probability signals are combined through Bayesian reasoning, the resulting computational structure is not merely analogous to a neural network — it is one. No one designs a neuron. The mathematics demands it.
2. Recap: From Scores to Probabilities
2.1 The Bayesian Derivation of Sigmoid
We briefly recall the central result. Starting from Bayes' theorem for document relevance $R$ given an observed BM25 score $s$:
$$P(R|s) = \frac{P(s|R) \cdot P(R)}{P(s|R) \cdot P(R) + P(s|\neg R) \cdot P(\neg R)}$$
Under the symmetric likelihood assumption $P(s|\neg R) = 1 - P(s|R)$ with a parametric sigmoid likelihood model, the posterior simplifies to:
$$P(R|s) = \sigma(\alpha(s - \beta)) = \frac{1}{1 + \exp(-\alpha(s - \beta))}$$
where $\alpha$ controls the steepness and $\beta$ controls the decision boundary.
The crucial observation — emphasized in both the previous essay and the paper — is that the sigmoid was not chosen; it was derived. We began with a question about probability and arrived at a specific functional form through algebraic necessity. This distinction proves essential for everything that follows.
2.2 Calibrated Probability Space
After this transformation, every scoring signal — whether from BM25, vector similarity, or any other retrieval mechanism — inhabits the same space: $[0, 1]$, with a consistent probabilistic interpretation. A score of 0.9 means "90% confidence that this document is relevant according to this signal." This uniformity is the foundation upon which multi-signal combination becomes mathematically principled.
3. Combining Signals: The Product Rule and Its Failure
3.1 Independence-Based Conjunction
Given $n$ independent relevance signals $P_1, P_2, \ldots, P_n$, the standard probabilistic conjunction under independence is:
$$P_{\text{AND}} = \prod_{i=1}^{n} P_i$$
This is the formula presented in our Bayesian BM25 paper (Section 5.1), and it is theoretically correct under the stated assumptions.
3.2 The Shrinkage Problem
However, this formulation suffers from a fundamental deficiency.
Theorem 3.2.1 (Conjunction Shrinkage). For $n$ signals each reporting probability $p \in (0, 1)$:
$$\prod_{i=1}^{n} p = p^n \to 0 \quad \text{as } n \to \infty$$
Two signals both reporting 0.7 yield $0.7 \times 0.7 = 0.49$. Three yield $0.7^3 = 0.343$. As we add more signals — each agreeing that a document is relevant — the combined probability decreases.
This violates a basic intuition: when multiple independent sources of evidence agree, confidence should increase, not decrease. The product rule answers "what is the probability that all conditions are simultaneously satisfied?" But what a search system needs is "how confident should we be given that multiple signals concur?" These are different questions.
3.3 Information-Theoretic View
From an information-theoretic perspective, the product rule discards a critical piece of information: the agreement itself. When $n$ independent signals all report high relevance, mutual agreement constitutes additional evidence that the product formula cannot represent. The product treats each signal as a filter rather than as corroborating testimony.
4. Log-Odds Conjunction: Evidence Accumulation
We now present a formulation that respects both the mathematics of probability and the human intuition of evidence accumulation.
4.1 Geometric Mean as Scale-Invariant Aggregation
The first step is to replace the product with the geometric mean, which neutralizes the dependence on signal count:
$$\bar{P} = \left(\prod_{i=1}^{n} P_i\right)^{1/n}$$
Proposition 4.1.1 (Scale Neutrality). If $P_i = p$ for all $i$, then $\bar{P} = p$ regardless of $n$.
This is immediate from $\bar{P} = (p^n)^{1/n} = p$. The geometric mean preserves the "average confidence" of the constituent signals without penalizing for their number. In practice, it is computed in log-space for numerical stability:
$$\bar{P} = \exp\left(\frac{1}{n}\sum_{i=1}^{n} \log P_i\right)$$
4.2 The Log-Odds Space
We now move to the log-odds (logit) space, where probabilities become unbounded real numbers amenable to additive operations:
$$\text{logit}(\bar{P}) = \log \frac{\bar{P}}{1 - \bar{P}}$$
The logit function is the canonical link function for Bernoulli distributions in the exponential family. It maps $[0, 1] \to \mathbb{R}$, converting multiplicative probability relationships into additive ones.
4.3 Conjunction Bonus
In the log-odds space, we apply an additive bonus that reflects the evidential value of multi-signal agreement:
$$\ell_{\text{adjusted}} = \text{logit}(\bar{P}) + \alpha \cdot \log n$$
where $\alpha = 0.5$ is a scaling constant.
Theorem 4.3.1 (Odds Interpretation). The conjunction bonus $\alpha \cdot \log n$ in log-odds space is equivalent to multiplying the odds by $n^{\alpha}$:
$$\frac{P_{\text{final}}}{1 - P_{\text{final}}} = \frac{\bar{P}}{1 - \bar{P}} \cdot n^{\alpha}$$
Proof. Taking the exponential of both sides of the log-odds equation:
$$\exp(\ell_{\text{adjusted}}) = \exp(\text{logit}(\bar{P})) \cdot \exp(\alpha \log n) = \frac{\bar{P}}{1 - \bar{P}} \cdot n^{\alpha} \quad \square$$
With $\alpha = 0.5$, the odds are multiplied by $\sqrt{n}$. Two agreeing signals boost the odds by $\sqrt{2} \approx 1.41$; three by $\sqrt{3} \approx 1.73$; ten by $\sqrt{10} \approx 3.16$.
4.4 Return to Probability Space
The final step is the inverse logit — the sigmoid function — which maps the adjusted log-odds back to $[0, 1]$:
$$P_{\text{final}} = \sigma(\ell_{\text{adjusted}}) = \frac{1}{1 + \exp(-\ell_{\text{adjusted}})}$$
Proposition 4.4.1 (Identity for Single Signals). When $n = 1$, the log-odds bonus vanishes ($\alpha \cdot \log 1 = 0$), so $P_{\text{final}} = P_1$. The transformation is transparent for single signals.
4.5 The $\sqrt{n}$ Scaling and Its Statistical Justification
The choice $\alpha = 0.5$ is not arbitrary. In classical statistics, when combining $n$ independent measurements, the standard error decreases proportionally to $1/\sqrt{n}$, and the corresponding confidence increases proportionally to $\sqrt{n}$. The conjunction bonus embeds this same principle in the odds domain: $n$ agreeing signals increase the odds by $\sqrt{n}$, representing a conservative Bayesian update where agreement is treated as repeated observation of a latent relevance variable.
4.6 Numerical Behavior
The following table illustrates the difference between the product rule and the log-odds conjunction for two signals:
| $P_{\text{text}}$ | $P_{\text{vec}}$ | Product | Log-Odds ($\alpha = 0.5$) | Interpretation |
|---|---|---|---|---|
| 0.9 | 0.9 | 0.81 | 0.95 | Strong agreement → boosted |
| 0.7 | 0.7 | 0.49 | 0.77 | Moderate agreement → preserved and boosted |
| 0.7 | 0.3 | 0.21 | 0.54 | Disagreement → near neutral |
| 0.3 | 0.3 | 0.09 | 0.38 | Agreement on irrelevance → still low |
The log-odds conjunction preserves the intuitive properties: agreement amplifies, disagreement moderates, and irrelevance is maintained.
5. The Emergence of the Neuron
We now arrive at the central observation of this essay. Let us trace the complete computational path for a single document in a hybrid search query.
5.1 The Full Pipeline
Layer 1 — Calibration: Each raw score $s_i$ passes through a sigmoid to produce a calibrated probability:
$$P_i = \sigma(\alpha_i(s_i - \beta_i)) = \frac{1}{1 + \exp(-\alpha_i(s_i - \beta_i))}$$
Aggregation — Geometric mean in log-space: The calibrated probabilities are averaged:
$$\bar{P} = \exp\left(\frac{1}{n}\sum_{i=1}^{n} \log P_i\right)$$
Layer 2 — Log-odds conjunction: The aggregated probability is transformed to log-odds, a linear bias is added, and the result passes through a sigmoid:
$$P_{\text{final}} = \sigma!\left(\text{logit}(\bar{P}) + \alpha \cdot \log n\right)$$
5.2 Reading the Structure
Now, consider what we have constructed:
- Input signals $s_1, s_2, \ldots, s_n$ (raw BM25 scores, vector similarities)
- First nonlinear transformation — sigmoid activation for each signal
- Linear combination in log-space — weighted sum of log-probabilities (the geometric mean), followed by an additive bias in log-odds space
- Second nonlinear transformation — sigmoid activation producing the output
This is a two-layer neural network. The first sigmoid layer serves as input normalization (calibration). The log-space aggregation with additive bias constitutes a hidden linear transformation. The second sigmoid layer produces the output activation.
No one designed this structure to resemble a neural network. We began with a purely probabilistic question — "how should we estimate the probability that a document is relevant given multiple evidence signals?" — and followed the mathematics. The neuron emerged.
5.3 Formal Correspondence
To make the correspondence explicit, recall the canonical form of a single neuron:
$$y = \sigma!\left(\sum_{i} w_i x_i + b\right)$$
where $\sigma$ is an activation function, $w_i$ are weights, $x_i$ are inputs, and $b$ is a bias.
In the log-odds conjunction, the inputs are $x_i = \log P_i$, the weights are $w_i = 1/n$ (uniform for geometric mean), and the bias is $b = \alpha \cdot \log n$. But $P_i$ is itself $\sigma(\alpha_i s_i + \beta_i')$, so the full computation is:
$$P_{\text{final}} = \sigma!\left(\text{logit}!\left(\exp!\left(\frac{1}{n}\sum_{i=1}^{n} \log \sigma(\alpha_i s_i + \beta_i')\right)\right) + \alpha \log n\right)$$
This is a composition of nonlinear activations separated by linear transformations — the defining characteristic of a feedforward neural network.
5.4 A Note on the Activation Function
It is worth pausing on the fact that the same function — the sigmoid — appears at both layers. In contemporary deep learning, practitioners choose activation functions (ReLU, GELU, Swish) based on empirical performance. In our derivation, the sigmoid is not chosen at all. It appears because it is the Bayesian posterior under the likelihood model. It appears again because the logit-sigmoid pair constitutes the canonical link function for Bernoulli random variables.
The sigmoid's dual role — as a Bayesian posterior and as a neural activation — is not a coincidence. Both arise from the same mathematical object: the logistic function, which is the natural parameter-to-mean mapping in the exponential family for binary outcomes. The neuron and the Bayesian posterior are the same thing expressed in different vocabularies.
6. Implications for Understanding Neural Networks
6.1 The Direction of Explanation
The standard narrative in machine learning proceeds from neurons to probabilistic interpretation: we build neural networks, then analyze them probabilistically. Bayesian neural networks, variational inference, and probabilistic deep learning all follow this direction — taking an existing neural architecture and asking what probabilistic model it corresponds to.
Our derivation reverses this direction. We begin with probability and arrive at neurons. This reversal carries significant implications.
If the neural structure is a consequence of probabilistic inference rather than a design decision, then it is not specific to any particular architecture or engineering choice. It is a property of the mathematics itself. Any system that performs Bayesian calibration of evidence and combines multiple signals through principled probabilistic reasoning will inevitably instantiate a neural computation.
6.2 Interpretability Without Inspection
Modern Explainable AI (XAI) approaches attempt to understand neural networks by inspecting them after training — examining activation patterns, computing attention maps, performing gradient-based attribution. These methods treat the network as a black box to be opened.
Our framework suggests an alternative. If a neural network is derived from a probabilistic inference problem, then each component already has an interpretation:
- First sigmoid layer: Bayesian calibration — "what is the probability that this signal indicates relevance?"
- Log-space aggregation: Evidence accumulation — "what is the average strength of evidence across signals?"
- Additive bias: Agreement bonus — "how much should multi-signal agreement increase confidence?"
- Second sigmoid layer: Posterior computation — "what is the final probability of relevance?"
There is nothing to "explain" because the derivation is the explanation. The structure was never opaque — it was generated by a transparent mathematical process.
6.3 The Question Behind the Black Box
This suggests a broader research program for neural network interpretability: rather than asking "what is this network doing?" (a reverse-engineering problem), ask "what probabilistic inference problem would produce this network?" (a mathematical derivation problem).
If we can identify the implicit probabilistic question that a trained network answers, then the network's behavior becomes interpretable not through post-hoc analysis but through the structure of the question itself.
We do not claim this is feasible for arbitrary deep networks in their current form. But the existence of at least one concrete case — a production search engine where the correspondence is exact and fully traced — establishes that the program is not vacuous.
7. WAND and BMW as Neural Pruning
7.1 The Pruning Problem in Neural Inference
A major computational challenge in deploying large neural networks is inference cost. Various pruning techniques have been developed to skip unnecessary computations: early exit, conditional computation, activation sparsity. Most of these methods are approximate — they sacrifice some accuracy for speed, relying on heuristics or learned gating mechanisms.
7.2 Exact Pruning in Search
In information retrieval, the WAND (Weak AND) and BMW (Block-Max WAND) algorithms achieve something remarkable: they skip large portions of documents during scoring with zero loss in ranking accuracy. The key property enabling this is the existence of computable upper bounds.
Definition 7.2.1 (WAND Pruning Condition). A document $d$ can be safely skipped if:
$$\sum_{t \in q} \text{ub}(t) < \theta$$
where $\text{ub}(t)$ is the maximum possible score contribution of term $t$, and $\theta$ is the current $k$-th highest score.
As shown in our Bayesian BM25 paper (Theorem 6.1.2), the monotonicity of the sigmoid transformation guarantees that BM25 upper bounds remain valid for Bayesian BM25. If a document cannot achieve a BM25 score sufficient to enter the top-$k$, it cannot achieve a Bayesian BM25 probability sufficient to enter the top-$k$.
7.3 Neural Translation
Translated into neural network terminology, WAND computes: "the maximum possible activation of this neuron, given an upper bound on its input, is below the threshold — therefore skip the computation entirely."
BMW refines this by partitioning the input space (documents) into blocks, precomputing maximum activations per block, and performing block-level pruning:
"No input in this entire block can produce an activation above threshold — skip the entire block."
The empirical skip rates from our paper are:
- Rare terms (high IDF): 90–99% of documents skipped
- Mixed queries: 50–80% skipped
- Common terms: 10–30% skipped
All with exact top-$k$ guarantee. No accuracy loss whatsoever.
7.4 Why This Works and What It Requires
The critical enabler is that the output space is bounded — probabilities live in $[0, 1]$ — and the transformation is monotonic. These two properties together make upper bounds computable and safe pruning possible.
In standard deep learning, neither property is guaranteed. Unbounded activations (ReLU) and non-monotonic transformations make it impossible to compute tight upper bounds on neuron outputs without actually computing them.
But if a neural network layer is understood as performing Bayesian inference — with sigmoid activations derived from probabilistic reasoning — then both properties hold by construction. Probabilities are bounded. The sigmoid is monotonic. WAND-style pruning becomes mathematically valid.
This observation points toward a potential transfer of three decades of information retrieval optimization techniques to neural network inference — not as heuristic approximations, but as exact algorithms with formal safety guarantees.
8. The Deeper Unity
8.1 Three Traditions, One Structure
We have now traced three intellectual traditions that converge on the same mathematical structure:
Probability theory asks: given evidence, what is the posterior probability of a hypothesis? The answer involves Bayes' theorem, likelihood ratios, and the logistic function as the canonical link for binary outcomes.
Information retrieval asks: given query terms and document features, how relevant is this document? The answer involves BM25 scoring, IDF weighting, and — as we have shown — sigmoid calibration with log-odds evidence accumulation.
Neural computation asks: given input signals and learned parameters, what is the output activation? The answer involves weighted linear combination, bias terms, and nonlinear activation functions.
These three questions, posed independently by different communities over different decades, produce the same computational graph: inputs → linear transformation → sigmoid → linear transformation → sigmoid.
8.2 Why Sigmoid?
The recurrence of the sigmoid across all three domains is not a coincidence but a mathematical inevitability. The logistic function is the unique function that satisfies the following system of constraints simultaneously:
- Maps $\mathbb{R} \to [0, 1]$ (probability axioms)
- Is the inverse of the log-odds function (exponential family canonical link)
- Has the self-referential derivative $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ (enabling gradient-based learning)
- Satisfies $\sigma(-x) = 1 - \sigma(x)$ (symmetry between positive and negative evidence)
- Arises as the maximum entropy distribution under first-moment constraints
Any system that processes binary evidence — relevant or not, firing or not, true or false — and respects these natural constraints will arrive at the sigmoid. The neuron does not imitate the Bayesian posterior. The Bayesian posterior does not imitate the neuron. Both are the sigmoid because nothing else can satisfy the constraints of binary probabilistic reasoning.
8.3 Robertson's Unfinished Circle
In 1976, Stephen Robertson introduced the Probability Ranking Principle, arguing that optimal retrieval is achieved by ranking documents in decreasing order of probability of relevance. He and colleagues subsequently derived BM25 from a probabilistic model of term occurrence, titling their foundational work The Probabilistic Relevance Framework.
Yet BM25 scores are not probabilities. The framework begins in probability and ends in an unbounded real number. For nearly five decades, this gap persisted — acknowledged in textbooks, worked around in practice, never formally closed.
Bayesian BM25 closes this gap. And in doing so, it reveals that the probabilistic relevance framework, when completed, naturally produces the computational structure that a separate scientific tradition would call a neural network.
Robertson opened a door in 1976. We now see what was on the other side.
9. Conclusion
We did not set out to build a neural network. We set out to answer a simple question: what is the probability that this document is relevant? We applied Bayes' theorem and the sigmoid emerged. We combined multiple signals through log-odds accumulation and a second sigmoid emerged. We looked at what we had written down and recognized a neuron.
This is perhaps the most striking aspect of the result. The neuron was not designed — it was discovered, latent within the structure of probabilistic inference over binary relevance judgments. The implications extend beyond information retrieval:
- For neural network theory: The existence of at least one concrete, fully interpretable case where a neural architecture arises from first-principles probability suggests that other architectures may admit similar derivations.
- For explainability: Networks derived from probabilistic reasoning are interpretable by construction. Each layer corresponds to a well-defined step in Bayesian inference.
- For efficient inference: Probabilistically derived networks inherit the bounded, monotonic properties that enable exact pruning algorithms — a potential bridge from three decades of IR optimization to neural network deployment.
The mathematics does not care what we call things. Whether we say "Bayesian posterior" or "neural activation," the sigmoid function appears wherever binary evidence is processed under uncertainty. The neuron is not an invention of neuroscience or machine learning. It is a theorem of probability.
This essay is a follow-up to Progressive and Adaptive Hyperparameter Estimation in BM25 Probability Transformation and the paper Bayesian BM25: A Probabilistic Framework for Hybrid Text and Vector Search.