Addressing the Conjunction Fallacy in Probabilistic Information Retrieval: From Theory to Practice
1. Introduction
In our previous explorations of probabilistic frameworks for information retrieval, we examined how transformations like softmax and sigmoid convert raw similarity scores into probabilities, enabling principled fusion of heterogeneous retrieval systems. While these transformations provide elegant mathematical foundations for ranking, they introduce a critical challenge when handling conjunctive queries—the "conjunction fallacy." This phenomenon occurs when the naive multiplication of term probabilities causes conjunctive query results to converge toward zero as the number of terms increases, contradicting user expectations that documents matching more query terms should receive higher relevance scores.
Consider a document with three query terms, each independently assigned a probability of 0.8. Under the standard probabilistic approach, the joint probability would be $0.8 \times 0.8 \times 0.8 = 0.512$, and with five terms, this drops to $0.8^5 \approx 0.33$. As term count increases, this effect becomes severe—even with high individual term probabilities of $0.9$, a ten-term query yields a joint probability of approximately $0.35$. This creates a paradoxical situation where adding more matching terms to a query actually reduces the computed relevance probability, directly contradicting the Probability Ranking Principle's aim to accurately estimate $P(r=1|d,q)$.
This essay explores this conjunction fallacy in depth, proposing advanced mathematical frameworks and practical implementations that preserve the probabilistic foundation of information retrieval while addressing this critical limitation. We will examine how these solutions apply to both traditional BM25-based systems and hybrid search approaches that combine sparse and dense retrievers.
2. Theoretical Analysis of the Conjunction Problem
2.1 The Mathematical Origins of the Fallacy
The conjunction fallacy stems from the direct application of probability theory to query term independence. If we consider each query term $t_i$ as contributing a probability $P(t_i|d)$ of relevance, the traditional approach computes joint probability as:
$$P(t_1 \cap t_2 \cap ... \cap t_n | d) = \prod_{i=1}^{n} P(t_i|d)$$
This follows from the naive assumption that terms are conditionally independent given the document. However, this approach has two fundamental flaws:
-
Diminishing probability with term count: As $n$ increases, the product $\prod_{i=1}^{n} P(t_i|d)$ monotonically decreases, even when all $P(t_i|d)$ are high.
-
Violation of user expectations: Users intuitively expect documents matching more query terms to receive higher relevance scores, not lower ones.
2.2 Information-Theoretic Perspective
From an information theory standpoint, we can reframe this problem by considering the logarithm of probabilities, which corresponds to information content:
$$\log P(t_1 \cap t_2 \cap ... \cap t_n | d) = \sum_{i=1}^{n} \log P(t_i|d)$$
Since $\log P(t_i|d) < 0$ for all $P(t_i|d) < 1$, the sum becomes increasingly negative as terms are added. However, additional matching terms should actually provide evidence of relevance rather than penalize it.
2.3 Bayesian Perspective on Conjunction
A more nuanced Bayesian view treats query terms as evidence of document relevance. Let's denote document relevance as $r \in {0,1}$. The posterior probability of relevance given multiple query terms can be expressed using Bayes' theorem:
$$P(r=1|t_1,...,t_n,d) = \frac{P(t_1,...,t_n|r=1,d)P(r=1|d)}{P(t_1,...,t_n|d)}$$
Under the assumption of conditional independence of terms given relevance status, we can decompose this further. First, let's define two key components:
$$A = P(r=1|d)\prod_{i=1}^{n}P(t_i|r=1,d)$$
This represents the joint probability of document relevance and observing all query terms in a relevant document.
$$B = P(r=0|d)\prod_{i=1}^{n}P(t_i|r=0,d)$$
This represents the joint probability of document non-relevance and observing all query terms in a non-relevant document.
Using these components, the posterior probability simplifies to:
$$P(r=1|t_1,...,t_n,d) = \frac{A}{A + B}$$
This formulation reveals that the problem isn't with conjunction itself, but with our simplistic modeling of the relationship between terms and relevance. When we look at this equation, we can see that as $n$ increases, both $A$ and $B$ tend to decrease due to the multiplication of probabilities. However, the rate of decrease depends on the relative magnitudes of $P(t_i|r=1,d)$ versus $P(t_i|r=0,d)$.
If a term is more likely to appear in relevant documents than non-relevant ones, i.e., $P(t_i|r=1,d) > P(t_i|r=0,d)$, then $A$ will decrease more slowly than $B$ as additional terms are added. This actually increases the posterior probability $\frac{A}{A+B}$ with each additional matching term, which aligns with user intuition.
The conjunction fallacy arises when we fail to account for these differential probabilities and instead naively multiply probabilities without considering the full Bayesian framework. This insight points us toward more sophisticated approaches to handling term conjunction in information retrieval systems.
3. Alternative Frameworks for Conjunction Handling
3.1 Geometric Mean Approach
One elegant solution to the conjunction fallacy is using the geometric mean of term probabilities instead of their product:
$$P_{geo}(t_1,...,t_n|d) = \left(\prod_{i=1}^{n} P(t_i|d)\right)^{1/n}$$
This formula effectively normalizes the joint probability by the number of terms, ensuring that adding more terms doesn't automatically decrease the score. The geometric mean preserves the relative ordering of documents that match the same number of terms while eliminating the penalty for higher term counts.
3.2 Noisy-OR Model
The Noisy-OR model offers an alternative probabilistic framework that addresses conjunction challenges. It models relevance probability as:
$$P(r=1|t_1,...,t_n,d) = 1 - \prod_{i=1}^{n} (1 - P(r=1|t_i,d))$$
This approach:
- Treats each term as an independent "cause" of relevance
- Increases probability as more terms match (unlike naive multiplication)
- Converges to 1 as the number of high-probability terms increases
The Noisy-OR model is particularly effective when any single term can establish document relevance, aligning with disjunctive (OR) interpretations of queries.
3.3 Conjunction Bonus Approach
A more direct approach is to apply an explicit conjunction bonus proportional to the number of matching terms:
$$P_{bonus}(t_1,...,t_n|d) = P_{base}(t_1,...,t_n|d) \times (1 + \alpha \cdot f(n))$$
Where:
- $P_{base}$ is the base probability (e.g., geometric mean of term probabilities)
- $f(n)$ is a monotonically increasing function of the term count $n$
- $\alpha$ is a scaling parameter
A logarithmic function for $f(n)$ works well in practice:
$$f(n) = \log(n)$$
This provides a controlled bonus that increases with term count but avoids excessive scaling.
3.4 Log-Probability Space
Working in log-probability space offers computational advantages and conceptual clarity. We can define:
$$L(t_i|d) = \log P(t_i|d)$$
And then combine log-probabilities using modified operations that avoid the conjunction fallacy. For example, we might use a weighted sum:
$$L(t_1,...,t_n|d) = \frac{1}{n} \sum_{i=1}^{n} L(t_i|d)$$
This is equivalent to the geometric mean in probability space but computationally more stable.
4. Implementation Strategies with Code Examples
4.1 Geometric Mean with Conjunction Bonus
This implementation combines the geometric mean with a logarithmic conjunction bonus:
import numpy as np
def conjunction_score(term_probabilities, alpha=0.2):
"""
Calculate combined relevance score using geometric mean and conjunction bonus.
Args:
term_probabilities: List of individual term probabilities
alpha: Coefficient for the conjunction bonus (default: 0.2)
Returns:
Combined probability score
"""
n = len(term_probabilities)
if n == 0:
return 0.0
# Calculate geometric mean
geo_mean = np.exp(np.mean(np.log(term_probabilities)))
# Apply conjunction bonus (logarithmic scaling)
conjunction_bonus = 1.0 + alpha * np.log(n)
return geo_mean * conjunction_bonus
This approach effectively handles the conjunction fallacy while maintaining a probabilistic interpretation. The parameter alpha
controls the strength of the conjunction bonus and can be tuned empirically.
4.2 Integration with BM25 Scoring
To integrate this approach with a BM25 scoring system that uses sigmoid transformation, we combine the individual term probabilities derived from BM25 scores:
def sigmoid_transform(bm25_score, a=1.0, b=-1.0):
"""Transform BM25 score to probability using sigmoid function."""
return 1.0 / (1.0 + np.exp(-(a * bm25_score + b)))
def bm25_conjunction_score(query_terms, document, index, a=1.0, b=-1.0, alpha=0.2):
"""Calculate conjunction score from BM25 term scores."""
term_probabilities = []
for term in query_terms:
bm25_score = compute_bm25_score(term, document, index)
term_prob = sigmoid_transform(bm25_score, a, b)
term_probabilities.append(term_prob)
return conjunction_score(term_probabilities, alpha)
The parameters a
and b
control the sigmoid transformation, while alpha
controls the conjunction bonus. These parameters can be learned from relevance judgments or set empirically.
5. Applications to Hybrid Search
5.1 Extending to Hybrid Retrieval Systems
The conjunction fallacy is particularly important in hybrid search systems that combine lexical and semantic retrievers. We can extend our probabilistic framework to handle both:
def hybrid_conjunction_score(query, document, lexical_index, vector_index):
# Get lexical (BM25) probabilities for each term
lexical_probs = []
for term in query.terms:
bm25_score = lexical_index.get_score(term, document)
term_prob = sigmoid_transform(bm25_score)
lexical_probs.append(term_prob)
# Get semantic similarity probability
vector_score = vector_index.get_similarity(query.embedding, document.embedding)
semantic_prob = sigmoid_transform(vector_score, a=2.0, b=-0.5) # Different calibration
# Combine lexical term probabilities with conjunction handling
lexical_combined = conjunction_score(lexical_probs, alpha=0.2)
# Combine lexical and semantic scores with appropriate weights
beta = 0.7 # Weight for semantic score (0.3 for lexical)
return beta * semantic_prob + (1 - beta) * lexical_combined
This approach addresses the conjunction fallacy in the lexical component while still leveraging semantic similarity.
5.2 Dynamic Weight Adjustment
We can further refine our hybrid approach by dynamically adjusting weights based on query characteristics:
def adaptive_hybrid_score(query, document, lexical_index, vector_index):
# Calculate lexical conjunction score
lexical_score = calculate_lexical_conjunction_score(query, document, lexical_index)
# Calculate semantic similarity score
semantic_score = calculate_semantic_score(query, document, vector_index)
# Dynamic weighting based on query characteristics
if is_navigational_query(query):
# Navigational queries favor exact matches
beta = 0.3 # Lower weight for semantic score
elif has_rare_terms(query, lexical_index):
# Queries with rare terms benefit from lexical matching
beta = 0.5 # Balanced weights
else:
# General queries benefit more from semantic understanding
beta = 0.7 # Higher weight for semantic score
return beta * semantic_score + (1 - beta) * lexical_score
This adaptive approach allows the system to emphasize lexical or semantic scores based on query properties, further improving performance.
6. Theoretical Analysis of the Enhanced Model
6.1 Information-Theoretic Guarantees
Our enhanced conjunction model can be analyzed from an information theory perspective. The geometric mean of probabilities corresponds to the arithmetic mean in log-probability space:
$$\log(P_{geo}) = \frac{1}{n}\sum_{i=1}^{n} \log(P(t_i|d))$$
This formulation distributes the "cost" of conjunction evenly across terms. The conjunction bonus adds a term that grows logarithmically with the number of terms:
$$\log(P_{bonus}) = \log(P_{geo}) + \log(1 + \alpha \cdot \log(n))$$
This logarithmic scaling ensures that the model remains well-behaved as the number of terms increases. The information gain from additional matching terms follows a diminishing returns pattern, aligning with human intuition.
6.2 Consistency with Probability Ranking Principle
Our enhanced model remains consistent with the Probability Ranking Principle (PRP), which states that documents should be ranked by their probability of relevance. The geometric mean and conjunction bonus preserve monotonicity in individual term probabilities while addressing the conjunction fallacy.
For any two documents $d_1$ and $d_2$ with matching probabilities for the same set of terms, their ranking order is preserved. The enhancement only affects comparisons between documents matching different numbers of terms, aligning the probabilistic ranking with user expectations.
7. Conclusion and Future Directions
The conjunction fallacy represents a fundamental challenge in probabilistic information retrieval, particularly as systems move toward hybrid architectures combining multiple evidence sources. Our exploration has shown that this challenge can be effectively addressed through principled mathematical approaches that preserve the probabilistic foundation while aligning with user expectations.
The geometric mean approach with conjunction bonus offers a simple yet effective solution that:
- Prevents probabilities from converging to zero as term count increases
- Rewards documents matching more query terms
- Preserves the relative ordering of similarly matching documents
- Integrates seamlessly with both traditional and neural retrieval systems
Future research directions include:
- Learned conjunction models: Training conjunction models from relevance judgments rather than using fixed formulas
- Term-specific weighting: Incorporating term importance in the conjunction calculation
- Cross-encoder reranking: Using transformer-based rerankers to better model term dependencies
- User modeling: Adapting conjunction handling based on user behavior and preferences
By addressing the conjunction fallacy, we move closer to the theoretical ideal of the Probability Ranking Principle while improving practical search performance. This brings together the mathematical elegance of probabilistic models with the pragmatic demands of real-world information retrieval systems.