Historically, the development of e-commerce has evolved from isolated product cards to complex decision support tools. In the 2010s, the emergence of comparison features was a response to the growing assortment and cognitive overload for users. However, classic metrics of correlation between the use of comparisons and higher check sizes consistently faced endogeneity issues: the feature is used by already motivated buyers with a high intention to purchase.
The measurement problem lies in a threefold complexity: self-selection based on engagement (selection bias), staggered rollout across categories disrupting synchrony (staggered adoption), and network effects within the category, where comparison redirects demand from one SKU to another. Without controlling for these factors, an analyst will obtain a biased estimate that overestimates the effect for active users and ignores external effects on those not using the feature.
A detailed solution requires a combination of Instrumental Variables (IV) and Difference-in-Differences (DiD). A quasi-random visibility of the comparison button is used as an instrument, for example, through an A/B test on the placement of the UI element or exogenous factors like screen resolution influencing visibility. This allows for isolating variation that is independent of user intentions. To control for temporal trends, staggered DiD is applied, comparing categories where the feature has already been launched with those not yet affected, adjusting for cohort fixed effects. The key metric becomes the Local Average Treatment Effect (LATE) — the effect for compliant users who utilized the comparison only due to the button's visibility, which provides a conservative but causally clean estimate.
Context: A large electronics marketplace launched a “Feature Comparison” function for smartphones and laptops. A month later, analytics showed that users who opened the comparison had an average check size 40% higher, but they simultaneously viewed four times more pages before purchasing.
Option 1: Direct group comparison (t-test). The analyst simply compares average metrics of users with a flag “used comparison” against “did not use” in SQL. Pros: requires a single query, results in minutes. Cons: completely ignores self-selection; high engagement precedes the use of the feature rather than following from it; the estimate is biased upwards.
Option 2: Before/After time analysis. Comparison of metrics across the entire platform before and after the launch of the feature. Pros: simplicity of interpretation, general trends are visible. Cons: seasonality (launch coincided with a new iPhone presentation), marketing campaigns and overall business growth completely mask the true effect; it is impossible to separate the influence of the feature from external shocks.
Option 3: Regression Discontinuity (RD). Using a threshold rule: the comparison button only appears after viewing three products in the same category. Pros: the sharp cutoff creates a quasi-experimental variation around the threshold. Cons: users manipulate behavior by opening empty tabs to reach the threshold; the fuzziness of the boundary violates RD assumptions.
Option 4: Instrumental Variables with UI test. An independent A/B test is conducted on the visibility of the button (brightness, size) that does not change functionality but influences the likelihood of a click. This test serves as an instrument for Two-Stage Least Squares (2SLS) regression. Pros: randomization ensures instrument exogeneity; the effect is measured precisely for those “forced” to compare due to button visibility. Cons: requires a large sample size for the strength of the instrument (first-stage F-statistic > 10); complexity of interpreting LATE for the business.
Chosen solution and justification: a combination of Option 4 (primary) and Option 2 (robustness check). IV estimation provides a causal effect for marginal users, while DiD confirms the absence of global biases across categories. This approach allows for separating the effect of the feature from the users' inherent activity.
Final result: The true incremental effect on AOV was +8% (instead of the observed +40%), and the decision-making time did not change statistically significantly. The feature was retained, but the recommendation algorithm was adjusted not to show the comparison button to users with low historical engagement, where the effect is close to zero, reducing server load without loss of revenue.
How to correctly handle correlation of errors within sessions when analyzing the choice among multiple alternatives?
When a user compares products, their decisions on each SKU are correlated within a single session, violating the independence of observations (i.i.d.) assumption. Standard errors of estimates will be underestimated, leading to false positive conclusions about the significance of the effect. To correct this, it is necessary to use clustered standard errors at the user or session level, or to apply hierarchical linear modeling (HLM). This is particularly critical when working with panel data, where one user generates multiple comparisons, and ignoring clustering can inflate the t-statistic by 2-3 times.
How to measure negative spillover effects on products that were not included in the comparison set?
The comparison function may cannibalize sales of products that were not added to the comparison list but are close substitutes. Candidates often look only at the SKU level within the basket, overlooking the overall category equilibrium. To assess such effects, it is necessary to analyze aggregated metrics at the category level (category-level DiD) and control for inventory levels. If comparisons redirect demand to specific models, causing shortages, the observed sales growth of competitors within the comparison set may be an artifact of stock-outs rather than user preference.
How to separate the effect of feature implementation from the effects of user learning (learning-by-doing) and novelty (novelty effect)?
Users discovering the new feature simultaneously accumulate experience using the platform, which separately influences conversion. Beginning analysts often interpret metric growth among early adopters as the pure effect of the product. To separate these effects, it is necessary to include user tenure fixed effects or limit the sample to users with the same number of historical sessions. Alternatively, cohort analysis can be used, comparing new users, for whom the feature is available from day one, with cohorts “before the launch,” adjusting for calendar time, thereby isolating the influence of experience from the influence of the comparison tool.