Thoughts

Progressive and Adaptive Hyperparameter Estimation in BM25 Probability Transformation: A Unified Approach

J

12 Mar 2025 — 12 min read

1. Introduction

The transformation of BM25 similarity scores into probability estimates represents a critical challenge in information retrieval systems. This process is essential for creating interpretable search results and enabling integration with probabilistic frameworks. While supervised learning approaches using query-document relevance pairs typically yield optimal results, practical implementations often face a scarcity of labeled data. This essay explores a dual approach to hyperparameter estimation that combines progressive estimation during indexing with adaptive adjustment using real-time query feedback, all without requiring explicit relevance judgments, with special focus on sigmoid-based Bayesian transformations.

2. BM25 Formula and Information-Theoretic Interpretation

2.1 The BM25 Scoring Function

The BM25 ranking function, a probabilistic retrieval framework, computes similarity scores between a query $Q$ and document $D$ as follows:

$$\text{BM25}(D,Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})}$$

Where:

$q_i$ represents terms in query $Q$
$f(q_i, D)$ is the term frequency of $q_i$ in document $D$
$|D|$ is the length of document $D$ (number of words)
$\text{avgdl}$ is the average document length in the corpus
$k_1$ and $b$ are free parameters (typically $k_1 \in [1.2, 2.0]$ and $b \approx 0.75$)
$\text{IDF}(q_i)$ is the inverse document frequency of term $q_i$, calculated as:

$$\text{IDF}(q_i) = \log\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}$$

Where $N$ is the total number of documents and $n(q_i)$ is the number of documents containing term $q_i$.

2.2 Information-Theoretic Interpretation

From an information theory perspective, BM25 can be understood through several fundamental concepts:

2.2.1 IDF as Information Content

The IDF component quantifies the information content or "surprise value" of a term. In information theory, the information content of an event with probability $p$ is defined as $-\log(p)$. Similarly, IDF approximates $-\log(p(q_i))$, where $p(q_i)$ is the probability of term $q_i$ occurring in a random document. Thus:

$$\text{IDF}(q_i) \approx -\log\left(\frac{n(q_i)}{N}\right)$$

Rare terms (low $p(q_i)$) carry high information content, while common terms carry less information—aligning with the intuition that distinctive terms should contribute more to relevance scoring.

2.2.2 BM25 as Divergence Measure

The BM25 function can be interpreted as measuring the divergence between the query term distribution and document term distribution. In information theory, the Kullback-Leibler (KL) divergence measures how one probability distribution differs from another:

$$D_{KL}(P||Q) = \sum_{i} P(i) \log\frac{P(i)}{Q(i)}$$

BM25's structure parallels this divergence measure, with term weighting informed by distributional properties of terms across the corpus.

2.2.3 Relationship to Language Models

BM25 can be related to language model approaches through the lens of cross-entropy. If we view the query as being generated by a document language model $\theta_D$, then scoring can be framed as:

$$\text{score}(Q,D) \propto -H(Q,\theta_D) = \sum_{i=1}^{n} P(q_i|Q) \log P(q_i|\theta_D)$$

The BM25 term weighting approximates this cross-entropy calculation while incorporating document length normalization and term saturation effects.

3. Sigmoid-Based Bayesian Framework for BM25 Score Transformation

3.1 Bayesian Interpretation of Probability Transformation

From a Bayesian perspective, we want to estimate $P(r|s)$ - the probability of a document being relevant given its BM25 score $s$. Using Bayes' theorem:

$$P(r|s) = \frac{P(s|r)P(r)}{P(s|r)P(r) + P(s|\neg r)P(\neg r)}$$

Where:

$P(r|s)$ is the posterior probability of relevance given the score
$P(s|r)$ is the likelihood of observing score $s$ for a relevant document
$P(s|\neg r)$ is the likelihood of observing score $s$ for a non-relevant document
$P(r)$ is the prior probability of relevance
$P(\neg r)$ is the prior probability of non-relevance $(1-P(r))$

3.2 Sigmoid Function and Its Bayesian Equivalence

The sigmoid function is defined as:

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

We can show that under certain distributional assumptions about relevant and non-relevant documents, the posterior probability $P(r|s)$ takes the form of a sigmoid function:

$$P(r|s) = \frac{1}{1 + e^{-(as + b)}}$$

Where $a$ and $b$ are parameters that encode information about:

The distributions of scores for relevant and non-relevant documents
The prior probabilities of relevance and non-relevance

3.3 Distributional Assumptions and Parameter Derivation

Let's assume that BM25 scores follow normal distributions for both relevant and non-relevant documents:

$$P(s|r) \sim \mathcal{N}(\mu_r, \sigma_r^2)$$
$$P(s|\neg r) \sim \mathcal{N}(\mu_{\neg r}, \sigma_{\neg r}^2)$$

Under this assumption, we can derive that:

$$P(r|s) = \frac{1}{1 + e^{-(as + b)}}$$

Where:

$$a = \frac{\mu_r - \mu_{\neg r}}{\sigma_r^2}$$

$$b = \frac{\mu_{\neg r}^2 - \mu_r^2}{2\sigma_r^2} + \ln\frac{P(r)}{P(\neg r)}$$

In the simplest case, if we assume equal variances $(\sigma_r^2 = \sigma_{\neg r}^2 = \sigma^2)$, then:

$$a = \frac{\mu_r - \mu_{\neg r}}{\sigma^2}$$

$$b = \frac{\mu_{\neg r}^2 - \mu_r^2}{2\sigma^2} + \ln\frac{P(r)}{P(\neg r)}$$

3.4 Information-Theoretic Basis for Sigmoid Transformation

Converting BM25 scores to probabilities via sigmoid can be justified through the principle of maximum entropy. When transforming scores to probabilities, we seek the distribution that:

Preserves the expected score (constraint)
Maximizes entropy (uncertainty)

The solution to this constrained optimization is the exponential family of distributions, which includes the logistic function:

$$P(r|s) = \frac{1}{1 + e^{-(as + b)}}$$

This transformation maintains the relative ordering of documents while converting scores to a probabilistic framework consistent with information theory principles.

4. Estimating Parameters Without Labeled Data

4.1 Simplifying Assumptions for Parameter Estimation

When we lack a query dataset with relevance judgments, we can estimate the parameters using several practical assumptions:

We assume a low prior probability of relevance: $P(r) = p$ (typically $p \approx 0.01$, meaning about 1% of documents are relevant to a typical query)
We assume that the mean score for relevant documents is significantly higher than for non-relevant documents, with:
- $\mu_{\neg r} \approx \mu_S$ (the mean of all scores, since most documents are non-relevant)
- $\mu_r \approx \mu_S + 2\sigma_S$ (relevant documents have scores about 2 standard deviations above the mean)
We assume equal variances: $\sigma_r^2 = \sigma_{\neg r}^2 = \sigma_S^2$

4.2 Resulting Parameter Estimates

Under these assumptions, our parameters become:

$$a \approx \frac{2\sigma_S}{\sigma_S^2} = \frac{2}{\sigma_S}$$

$$b \approx \frac{\mu_S^2 - (\mu_S + 2\sigma_S)^2}{2\sigma_S^2} + \ln\frac{p}{1-p} \approx -\frac{2\mu_S + 2\sigma_S}{\sigma_S} + \ln\frac{p}{1-p}$$

Simplifying:

$$a \approx \frac{2}{\sigma_S}$$

$$b \approx -\frac{2\mu_S}{\sigma_S} - 2 + \ln\frac{p}{1-p}$$

For typical values like $p = 0.01$, we have $\ln\frac{p}{1-p} \approx \ln(0.01/0.99) \approx -4.6$, so:

$$b \approx -\frac{2\mu_S}{\sigma_S} - 6.6$$

4.3 Distribution-Based Parameter Estimation

More generally, during the indexing phase, we can analyze the statistical distribution of BM25 scores across the corpus to derive initial parameter estimates. Given that no query dataset exists, we leverage the empirical distribution of potential BM25 scores.

Let $S = \{s_1, s_2, ..., s_n\}$ represent a sample of BM25 scores generated by applying random term combinations to the index. We can derive alternative initial parameter estimates as follows:

$$\hat{a} = \frac{\log(9)}{\sigma_S}$$

$$\hat{b} = -\hat{a} \cdot \mu_S$$

This calibration ensures that scores at $\mu_S + \sigma_S$ correspond to approximately 0.9 probability, while scores at $\mu_S - \sigma_S$ correspond to approximately 0.1 probability.

5. Progressive Parameter Estimation During Indexing

As indexing progresses, we can iteratively refine these parameter estimates using batch-based updating. For each batch of documents $B_t$ indexed at time $t$, we compute:

$$S_t = \{\text{scores from random term samples in } B_t\}$$

$$\mu_{S_t}, \sigma_{S_t} = \text{mean and std. dev. of } S_t$$

We then update our parameter estimates using exponential moving averages:

$$\hat{a}_t = (1-\alpha) \cdot \hat{a}_{t-1} + \alpha \cdot \frac{2}{\sigma_{S_t}}$$

$$\hat{b}_t = (1-\alpha) \cdot \hat{b}_{t-1} + \alpha \cdot (-\frac{2\mu_{S_t}}{\sigma_{S_t}} - 6.6)$$

Where $\alpha \in (0,1)$ is a smoothing parameter controlling adaptation rate. This approach allows parameters to evolve gradually as more documents are indexed, capturing the evolving characteristics of the corpus.

5.2 Density Estimation for Enhanced Calibration

For more sophisticated calibration, we can employ kernel density estimation (KDE) to model the underlying distribution of scores:

$$\hat{f}(s) = \frac{1}{nh} \sum_{i=1}^{n} K\left(\frac{s-s_i}{h}\right)$$

Where $K$ is a kernel function (e.g., Gaussian) and $h$ is the bandwidth parameter. By modeling the distribution directly, we can derive more precise transformations that respect the empirical characteristics of the score distribution.

6. Adaptive Parameter Adjustment with Real Queries

6.1 Pseudo-Relevance Feedback Approach

Once the system is operational, real user queries provide valuable information for refining parameter estimates. Using a pseudo-relevance feedback approach, we assume that top-ranked results have higher probability of relevance than lower-ranked results.

For each query $q$ with results $\{d_1, d_2, ..., d_k\}$ and corresponding BM25 scores $\{s_1, s_2, ..., s_k\}$, we generate synthetic relevance labels:

$$\hat{r}_i = \exp\left(-\lambda \cdot \frac{\text{rank}(d_i)-1}{k-1}\right)$$

Where $\lambda > 0$ controls the decay rate of assumed relevance with rank position. This creates a continuous spectrum of pseudo-relevance judgments from high (top results) to low (bottom results).

6.2 Online Logistic Regression for Sigmoid Parameters

With these pseudo-relevance labels, we can perform online logistic regression to update our sigmoid parameters incrementally:

$$\hat{a}_{t+1} = \hat{a}_t + \eta \cdot \sum_{i=1}^{k} (\hat{r}_i - \hat{P}(r|s_i)) \cdot s_i$$

$$\hat{b}_{t+1} = \hat{b}_t + \eta \cdot \sum_{i=1}^{k} (\hat{r}_i - \hat{P}(r|s_i))$$

Where $\eta$ is the learning rate and $\hat{P}(r|s_i) = \frac{1}{1 + e^{-(\hat{a}_t s_i + \hat{b}_t)}}$ is the current probability estimate. This stochastic gradient descent approach allows parameters to adapt to the observed query-result patterns.

6.3 Bayesian Online Learning

A more sophisticated approach employs Bayesian online learning to maintain uncertainty estimates about the parameters. We model the parameters as random variables with prior distributions:

$$a \sim \mathcal{N}(\mu_a, \sigma_a^2)$$

$$b \sim \mathcal{N}(\mu_b, \sigma_b^2)$$

After observing query results, we update these distributions using Bayes' rule:

$$P(a,b|D_t) \propto P(D_t|a,b)P(a,b|D_{t-1})$$

Where $D_t$ represents the observed data up to time $t$. While exact Bayesian inference is intractable for logistic regression, approximations like Laplace approximation or variational inference can be employed.

7. Combined Approach: Dual Estimation Strategy

7.1 Integration Framework

The dual approach integrates progressive estimation during indexing with adaptive adjustment during query processing. The key insight is recognizing these as complementary processes operating at different timescales with different information sources.

We combine these approaches through a weighted interpolation:

$$\hat{a}_{\text{combined}} = w_{\text{index}} \cdot \hat{a}_{\text{index}} + w_{\text{query}} \cdot \hat{a}_{\text{query}}$$

$$\hat{b}_{\text{combined}} = w_{\text{index}} \cdot \hat{b}_{\text{index}} + w_{\text{query}} \cdot \hat{b}_{\text{query}}$$

Where the weights satisfy $w_{\text{index}} + w_{\text{query}} = 1$ and evolve over time, typically increasing $w_{\text{query}}$ as more real query data becomes available.

7.2 Adaptive Weight Adjustment

The relative contribution of indexing-based versus query-based estimates can itself be adaptively adjusted. We model this as a dynamic process:

$$w_{\text{query}}(t) = 1 - \exp(-\gamma \cdot N_{\text{queries}}(t))$$

Where $N_{\text{queries}}(t)$ is the number of observed queries at time $t$, and $\gamma > 0$ controls the rate at which query-based estimates gain prominence. This ensures a smooth transition from primarily index-based estimation to query-based estimation as user interaction data accumulates.

7.3 Confidence-Weighted Combination

A more principled approach incorporates uncertainty estimates for each parameter source:

$$\hat{a}_{\text{combined}} = \frac{\hat{a}_{\text{index}}/\sigma^2_{a,\text{index}} + \hat{a}_{\text{query}}/\sigma^2_{a,\text{query}}}{1/\sigma^2_{a,\text{index}} + 1/\sigma^2_{a,\text{query}}}$$

$$\hat{b}_{\text{combined}} = \frac{\hat{b}_{\text{index}}/\sigma^2_{b,\text{index}} + \hat{b}_{\text{query}}/\sigma^2_{b,\text{query}}}{1/\sigma^2_{b,\text{index}} + 1/\sigma^2_{b,\text{query}}}$$

This represents an optimal combination under Gaussian uncertainty assumptions, weighting each estimate by its precision (inverse variance).

8. Information-Theoretic Analysis of the Dual Approach

8.1 Minimizing Cross-Entropy Loss

From an information theory perspective, the goal of learning optimal parameters $(a,b)$ can be framed as minimizing the cross-entropy between the true relevance distribution $P^*(r|s)$ and our estimated distribution $P_{\theta}(r|s)$ where $\theta = (a,b)$:

$$H(P^*, P_{\theta}) = -\mathbb{E}_{s \sim D}[P^*(r|s)\log P_{\theta}(r|s) + (1-P^*(r|s))\log(1-P_{\theta}(r|s))]$$

Our dual estimation approach can be viewed as approximating this minimization without access to the true distribution $P^*(r|s)$, using instead:

The empirical distribution of scores during indexing
Pseudo-relevance feedback from real queries

8.2 Parameter Estimation as Mutual Information Maximization

The process of finding optimal parameters can also be interpreted as maximizing the mutual information between BM25 scores and relevance. Mutual information is defined as:

$$I(S;R) = H(R) - H(R|S)$$

Where $H(R)$ is the entropy of the relevance distribution and $H(R|S)$ is the conditional entropy of relevance given the score. By learning parameters that yield probabilities closely aligned with true relevance, we effectively minimize $H(R|S)$, thus maximizing $I(S;R)$.

8.3 Connection to Maximum Entropy Principle

The sigmoid transformation we employ satisfies the maximum entropy principle—it produces the distribution with maximum entropy subject to constraints on expected values. This aligns with information theory's preference for distributions that avoid unnecessary assumptions.

For our score-to-probability mapping, the sigmoid function can be derived as the maximum entropy distribution when we constrain only the expected value of the score for relevant documents. This provides another theoretical justification for our choice of transformation function.

9. Theoretical Analysis

9.1 Convergence Properties

Under certain regularity conditions, the dual estimation approach can be shown to converge to optimal parameter values. The rate of convergence depends on several factors:

The true underlying relationship between BM25 scores and relevance
The diversity and representativeness of indexed documents
The diversity and representativeness of user queries
The learning rates and adaptation parameters

For the query-based component specifically, if user queries exhibit sufficient exploration of the document space, stochastic gradient descent convergence theorems apply, yielding:

$$E[(\hat{a}_t - a^*)^2 + (\hat{b}_t - b^*)^2] \leq \frac{C}{\sqrt{t}}$$

Where $a^*$ and $b^*$ are the optimal parameter values, and $C$ is a constant depending on the learning rate and problem characteristics.

9.2 Regret Analysis

From a decision-theoretic perspective, we can analyze the regret of the dual estimation approach—the cumulative loss incurred by using estimated parameters rather than optimal ones:

$$R(T) = \sum_{t=1}^{T} [L(\hat{a}_t, \hat{b}_t) - L(a^*, b^*)]$$

Where $L(a,b)$ is a loss function measuring the quality of probability estimates under parameters $a$ and $b$. Under standard assumptions, the regret of the dual estimation approach can be bounded by $O(\sqrt{T})$, a favorable sublinear rate indicating diminishing average regret over time.

9.3 Bias-Variance Tradeoff

The dual estimation approach manages the bias-variance tradeoff effectively:

The indexing-based component provides low-variance estimates but may be biased due to the absence of relevance judgments.
The query-based component gradually reduces bias through direct observation but exhibits higher variance, especially with limited data.

The weighted combination mitigates these limitations, with the balance shifting toward the query-based component as confidence increases.

10. Practical Implementation Strategy

Combining the indexing-time and query-time approaches, we recommend:

Initialize with conservative defaults: $a = 1.0$, $b = -1.0$
During indexing: Progressively update parameter estimates based on score distributions
$$a = \frac{2}{\sigma_S}$$
$$b = -\frac{2\mu_S}{\sigma_S} - 6.6$$
During query processing: Refine parameters using pseudo-relevance feedback
$$P(r|s) = \frac{1}{1 + e^{-(as + b)}}$$
Combine estimates with a weighted average that gradually shifts from indexing-based to query-based estimates:
$$a_{\text{combined}} = w_{\text{index}} \cdot a_{\text{index}} + w_{\text{query}} \cdot a_{\text{query}}$$
$$b_{\text{combined}} = w_{\text{index}} \cdot b_{\text{index}} + w_{\text{query}} \cdot b_{\text{query}}$$

11. Practical Considerations and Limitations

Several practical considerations arise in implementing the dual estimation approach:

Computational Efficiency: The progressive estimation during indexing should be computationally lightweight to avoid impacting indexing performance.
Cold Start Problem: Initially, when both indexing and query data are limited, parameter estimates may be unstable. Conservative default values (e.g., $a=0.5, b=-1.0$) can provide a reasonable starting point.
Concept Drift: If the document corpus or query patterns evolve significantly over time, the system must adapt accordingly. This requires continuous monitoring and potentially time-decaying memory mechanisms.
Evaluation Challenges: Without ground-truth relevance judgments, evaluating the quality of probability calibration becomes challenging. Indirect metrics like probability distribution uniformity or correlation with user engagement can serve as proxies.

12. Conclusion

This unified approach to hyperparameter estimation for BM25 probability transformation offers a theoretically sound and practically viable solution to the challenge of operating without explicit query-document relevance judgments. By combining progressive estimation during indexing with adaptive adjustment based on real queries, the system can effectively learn to transform raw similarity scores into meaningful probability estimates using the sigmoid function within a Bayesian framework.

From an information-theoretic perspective, this approach aligns with fundamental principles of entropy maximization and cross-entropy minimization. The BM25 formula itself has deep connections to information theory through its IDF component, which quantifies information content, and its overall structure, which approximates distributional divergence measures.

The sigmoid-based Bayesian transformation offers an elegant statistical foundation that connects the distributions of scores for relevant and non-relevant documents to the parameters of our transformation function. By leveraging both corpus statistics during indexing and implicit feedback from real queries, we create a robust system that continuously refines its probability calibration over time.

This approach exemplifies the philosophy of continuous learning systems that evolve through both data preparation (indexing) and user interaction (querying). It demonstrates how Bayesian and information-theoretic principles can guide the design of systems that operate under uncertainty and incrementally refine their behavior as more information becomes available.

Future research directions include extending this approach to more complex transformation models beyond the simple sigmoid function, incorporating explicit user feedback when available, and developing more sophisticated evaluation methodologies for probability calibration in the absence of ground truth.

1. Introduction

2. BM25 Formula and Information-Theoretic Interpretation

2.1 The BM25 Scoring Function

2.2 Information-Theoretic Interpretation

2.2.1 IDF as Information Content

2.2.2 BM25 as Divergence Measure

2.2.3 Relationship to Language Models

3. Sigmoid-Based Bayesian Framework for BM25 Score Transformation

3.1 Bayesian Interpretation of Probability Transformation

3.2 Sigmoid Function and Its Bayesian Equivalence

3.3 Distributional Assumptions and Parameter Derivation

3.4 Information-Theoretic Basis for Sigmoid Transformation

4. Estimating Parameters Without Labeled Data

4.1 Simplifying Assumptions for Parameter Estimation

4.2 Resulting Parameter Estimates

4.3 Distribution-Based Parameter Estimation

5. Progressive Parameter Estimation During Indexing

5.1 Iterative Refinement During Indexing

5.2 Density Estimation for Enhanced Calibration

6. Adaptive Parameter Adjustment with Real Queries

6.1 Pseudo-Relevance Feedback Approach

6.2 Online Logistic Regression for Sigmoid Parameters

6.3 Bayesian Online Learning

7. Combined Approach: Dual Estimation Strategy

7.1 Integration Framework

7.2 Adaptive Weight Adjustment

7.3 Confidence-Weighted Combination

8. Information-Theoretic Analysis of the Dual Approach

8.1 Minimizing Cross-Entropy Loss

8.2 Parameter Estimation as Mutual Information Maximization

8.3 Connection to Maximum Entropy Principle

9. Theoretical Analysis

9.1 Convergence Properties

9.2 Regret Analysis

9.3 Bias-Variance Tradeoff

10. Practical Implementation Strategy

11. Practical Considerations and Limitations

12. Conclusion

Read more

The Immortal Divine, Part IV: The Language of Questions and the Ethics of Immediate Responsibility

Geometric Adam: A Ray Tracing-Inspired Approach to Neural Network Optimization

The Immortal Divine, Part III: From Transcendental Paradox to Ontological Ethics

The Immortal Divine, Part II: Science as Modern Theology and the Structural Necessity of Transcendence