The historical context of the problem dates back to the evolution of user content in e-commerce. In the early days of digital commerce, professional descriptions dominated, but with the rise of Web 2.0, there was a shift to UGC (User Generated Content), which increased trust but created an overload of information. Modern users face dozens of reviews for products, which raises cognitive load and decision-making time. The introduction of Large Language Models (LLM) has allowed for automation of summarization, however, replacing the authentic voice of the consumer with machine interpretation introduces uncertainty into the causal relationship between the displayed information and user behavior.
The problem is complicated by three factors, making classic A/B testing impossible. First, the phased rollout by categories creates staggered adoption, where control groups gradually become test groups, disrupting the stability of comparisons. Second, the quality of AI summarization is endogenous: categories with a high volume of reviews receive accurate badges, while those with low volumes receive distorted ones, correlating with product popularity as a hidden confounder. Third, there is a risk of the deception effect: if a user finds a mismatch between the badge and the actual product, trust in the platform will decline, impacting long-term retention, which can only be measured through cohort analysis.
A detailed solution requires a combination of quasi-experimental methods. The main tool is Staggered Difference-in-Differences (DiD) with fixed effects for categories and time effects, allowing for capturing the effect in conditions of gradual implementation. To account for the endogeneity of generation quality, Causal Forest is applied, modeling the heterogeneity of the impact depending on the volume of training data. It is critically important to conduct Placebo tests on unchanged categories to validate parallel trends and also use Survival Analysis to track the dynamics of returns over time, separating short-term conversion effects from long-term trust effects.
The marketplace “DomashniyUyut”, specializing in furniture and decor, faced a critical decline in engagement on product pages, where 68% of users did not reach the section with text reviews, missing important information about assembly quality and materials. The product team proposed an innovative solution - replacing extensive comments with visual AI badges summarizing key points. However, stakeholders feared hidden degradation of trust metrics and increased returns due to potential model “hallucinations.” Analysts were tasked with measuring the net causal effect of the implementation in the absence of a feasible classic split-test on users.
The first option proposed classic A/B testing with randomization at the user level via hashes from user_id. The pros of this approach included strict causal identification and simplicity of statistical processing through standard t-test or bootstrap. The cons turned out to be critical for the product: users actively shared screenshots of products on social media, creating inter-group contamination, and differing displays of the same product across various users disrupted UX consistency, introducing cognitive dissonance.
The second option was based on the Synthetic Control Method, where a weighted synthetic control from unchanged categories with similar historical conversion trends and seasonality would be created for each category implementing AI badges. Key advantages included the natural perception by users and no need to split traffic, preserving the integrity of user experience. However, substantial drawbacks included the inability to build a credible control for unique categories like “smart refrigerators” without direct analogs, as well as the risk of bias from global shocks affecting all categories simultaneously.
The optimal solution was a combination of Staggered Difference-in-Differences with Two-Way Fixed Effects (TWFE) and Causal Forest for analyzing the heterogeneity of effects based on the volume of input data. This approach allowed us to utilize the natural order of phased implementation (first mass electronics, then furniture) as a source of exogenous variation while controlling for category and time fixed effects. A critical factor in the choice was the ability to model varying impacts for high-traffic categories with accurate summaries versus niche ones with LLM “hallucinations,” providing a strategic advantage in scaling decisions.
The final implementation revealed pronounced heterogeneity: in categories with more than 50 reviews, conversion increased by 12% due to reduced cognitive load, and returns decreased by 3% thanks to accurate transmission of key characteristics. Conversely, in niche categories with fewer than 10 reviews, return rates increased by 8% due to discrepancies between generated badges and actual product quality, leading to a decision to completely disable AI summarizations for segments with insufficient data volume. As a result, the platform maintained a neutral effect on overall GMV, but significantly improved user experience quality and reduced operational costs associated with processing returns in high-traffic categories.
Endogeneity of Generation Quality as a Confounder
Candidates often interpret the implementation of badges as a binary impact, overlooking that the effectiveness of LLM summarization is a continuous function of the volume of input reviews, rather than a constant. In reality, categories with high conversion draw more reviews initially, creating reverse causality: popularity → data volume → AI quality → observed conversion growth, which is mistakenly attributed solely to visual badges. A correct approach requires using instrumental variables, such as product age as an instrument for the volume of reviews, or applying Regression Discontinuity at the threshold of review count to isolate the pure effect of generation quality from the effect of category popularity.
Cross-category Spillovers and Attention Substitution
Candidates rarely consider that users compare products across categories within a single session, creating cross-category spillovers. If attractive AI badges appear in the “Smartphones” category, while traditional text blocks remain in “Cases,” this creates information asymmetry, shifting demand to the test category not due to improved UX, but due to attention substitution. For a proper assessment, it is necessary to include cross-category effects in the model via Spatial Econometrics or analyze changes in the share of wallet for the category in the overall user order, rather than just within-category conversion.
Dynamic Exposure Effect and Learning Curve
Junior analysts fixate on static effects in short-term observation windows, missing that the perception of AI content changes over time as user experience accumulates. Initial users perceive badges as objective aggregates, but after the first return of a product with a deceptive badge, AI skepticism forms, and the positive effect fades or even inverts to negative. Identifying this pattern requires an Event Study with lags and leading variables, as well as segmentation based on the “age” of the user relative to the first contact with AI content, allowing for the construction of a learning curve and forecasting the long-term sustainability of the effect.