The problem of assessing the quality of search results is tied to the fundamental observation paradox: we only see clicks on positions that users viewed, but the probability of viewing exponentially decreases with rank. The classical work by Joachims et al. on position bias and Richardson on the examination hypothesis laid the foundation for understanding that a click does not equate to relevance. In the context of product analytics, this leads to the need to separate the true preference of the user from artifacts of the interface, especially when changes in the ranking algorithm affect the entire user base simultaneously.
During a global update of the search engine, observed metrics (CTR, depth of browsing, conversion) change under the influence of two confounders: the change in the order of documents and the change in their viewing probability. Without the ability to split users into control and test groups, classic A/B testing is impossible, and seasonal fluctuations generate temporal trends that correlate with the release moment. The analyst's task is to isolate the pure effect of ranking from these noises in the face of limited data.
The optimal approach combines quasi-experimental methods and adjustments for biases. At the first stage, Difference-in-Differences with synthetic control is applied: a weighted combination of historical periods or product segments is constructed to minimize the pre-treatment error in predicting metrics. To adjust for position bias, Inverse Propensity Weighting (IPW) is used, where propensity scores are estimated through the probability of viewing a position based on past randomized logs or through the Expectation-Maximization algorithm under the assumption of the Examination-Cascade Model. Additionally, for non-linear effects, Causal Forests are applied, allowing modeling of effect heterogeneity across product categories and user segments.
In an electronics marketplace, the search team replaced BM25 with a neural network BERT-based ranker, optimized for margins. Two weeks after the release, the GMV per search session metric increased by 18%, but the depth of browsing decreased by 25%. The business was uncertain whether the growth was related to the algorithm or the start of sales that coincided with the release, and also worried about the degradation of user experience on the long tail of queries.
The first option considered was a simple comparison of metrics before and after the release using a t-test. The advantage was speed and the lack of need for complex infrastructure. However, the downsides were clear: the inability to separate the seasonal effect of sales from the algorithm's effect, ignoring position bias (the new algorithm could be showing more expensive products higher just because they generate more revenue, not because they are more relevant), and failing to account for the overall inflation of demand during promotions.
The second option was an Interrupted Time Series (ITS) analysis with seasonal decomposition through Prophet or SARIMA. This would allow for capturing trends and seasonality by constructing a counterfactual forecast for metrics without the release. The pros included statistical rigor and the ability to model autocorrelations. The cons, however, lay in the sensitivity to the break point (if the release was gradual), the complexity of interpreting the coefficients for the business, and the assumption of linear trends, which is often violated in e-commerce during mass promotional campaigns.
The third option was to develop a Synthetic Control Method at the category level: creating a weighted basket from untouched queries or categories where the algorithm had not changed (e.g., due to technical limitations in certain locales) as a control group for comparison. The advantages included visual clarity and intuitiveness for stakeholders, as well as less sensitivity to assumptions about the distribution of errors. Disadvantages included the need to identify suitable control units with similar dynamics (which is difficult during a global release) and the risk of overfitting when selecting weights.
Ultimately, a hybrid methodology was selected: Diff-in-Diff with synthetic control at the product category level, combined with IPW adjustment for position exposure. This allowed to separate the effect of ranking changes from seasonal spikes and adjust for the distortion caused by the more expensive products being displayed more frequently at the top. The choice was determined by the need to consider both the temporal structure of the data and structural biases in exposure.
The result established that 14% of the 18% increase in GMV is explained by the algorithm, while the remaining 4% is due to seasonality. It was also found that for head queries (top 20% by frequency), conversion increased by 22%, while for tail queries, it decreased by 15%, compensated by an increase in average check. This led to the decision to implement a hybrid scheme: a neural network ranker for popular queries and a classic one for rare queries, which balanced the metrics.
How to correctly account for position bias in the absence of a randomized experiment?
Without special randomized displays, propensity can be estimated through the Expectation-Maximization algorithm, assuming that a click = examination × relevance. Candidates often suggest simply adding position as a feature in regression, but this ignores the non-linear interaction between position and relevance. The correct approach is to use Click Models (Cascade Model or DBN — Dependent Click Model) to estimate examination probability, and then weight observations inversely proportional to this probability (IPW). Without this, the assessment of the ranking effect will be biased towards top-heavy results.
Why does simply comparing clicks before and after the algorithm change yield a biased estimate even considering seasonality?
In addition to position bias, there is the effect of exploration vs exploitation and user learning. The new algorithm may explore (explore) less by providing more predictable results, which lowers engagement in the short term. Alternatively, users may adapt to the new result structure, changing scrolling behavior, which violates the stationarity assumptions of time-series analysis. Candidates often overlook the necessity to check the parallel trends assumption in Diff-in-Diff on pre-period data and the importance of lags in aggregation (you cannot compare day to day because of day-of-week effects; at least weekly aggregation is required).
How to distinguish the effect of improving request-item matching from the effect of changing the assortment composition of the top search results?
This distinction is critical for understanding the long-term impact on LTV. If the new algorithm simply shifts the results towards more expensive products (assortment shift), rather than better understanding user intent (relevance improvement), the increase in conversion may be short-term due to the novelty effect. To separate them, one should use Causal Forests or Meta-learners (S-Learner, T-Learner) with fixed product effects, to compare the same item in different positions before and after the change. If the effect is observed only due to a shift in the composition of the top items (e.g., disappearance of budget options), this requires a different product response than if CTR improved at fixed positions for that item.