Answer to the question

Content personalization has become an integral part of modern e-commerce platforms since the mid-2010s, when Amazon and Netflix demonstrated the economic viability of investments in recommendation systems. Classic approaches to assessing effectiveness typically involve conducting controlled experiments; however, real-world infrastructure often presents technical limitations that render standard A/B testing impossible without degrading performance.

The analyst's task is to isolate the true effect of implementing an ML recommendation system on key product metrics in the absence of a control group. Three confounding factors need to be considered: the time lag in training the model for cold users (cold start problem), the short-term spike in activity due to interface changes (novelty effect), as well as systematic differences between cohorts of new and returning users, creating selection bias.

The optimal approach combines the Difference-in-Differences (DiD) method and the Synthetic Control Method. The control group is formed by a cohort of new users who registered after the change, adjusted for differences in baseline characteristics through propensity scoring. To account for cold start, the analysis is stratified by user tenure with separate modeling of the algorithm's learning curve. The novelty effect is isolated through the analysis of metric dynamics in the first 14 days post-release, followed by a comparison with a stationary period. Additionally, the triple difference approach is employed, utilizing geographical regions with varying adoption rates as a natural experiment.

Real-life situation

At a large fashion marketplace, there was a plan to replace the static homepage with a manually curated trends selection with a dynamic feed generated by an ML model based on collaborative filtering. The technical team reported that due to the Edge Cache settings on Cloudflare, it was impossible to ensure traffic separation at the user level without significant performance degradation and violations of SLA for response time. The release was to happen simultaneously for all users during the peak season (November), complicating the assessment further due to Black Friday and pre-holiday excitement, distorting historical behavior patterns.

The first approach involved using a simple before-after analysis adjusted for seasonality of previous years through indices. This method had high operational simplicity and did not require complex data infrastructure, but critically suffered from the assumption of unchanged baseline trends between periods. In a growing e-commerce market, this led to an overestimation of the effect by 40-60% due to macroeconomic factors and inflation in demand.

The second option included building a synthetic control based on the behavior of mobile app users, where personalization had been introduced earlier and functioned stably. This method allowed for consideration of the specificity of product metrics and seasonal fluctuations through a weighted combination of historical data. However, it required a strong assumption of parallel trends between web and mobile, which was not met due to demographic differences between audiences and varying user scenarios (the web was used for deep searches, while the app was for quick purchases).

The third approach suggested using a quasi-experimental difference model (DiD), comparing metric dynamics between users with rich histories and newcomers experiencing cold start. This method isolated the effect of the recommendation system from the model training effect, using the interaction between time and user type as a source of variation. A key limitation was the necessity for the assumption of no systematic shocks simultaneously affecting both groups differently, requiring thorough checking of parallel trends in the pre-intervention period.

A hybrid approach was chosen, combining DiD with post-stratification by cohorts and adjustment for the algorithm's learning curve. This solution allowed controlling for both individual heterogeneities between user segments and time trends at the market level. A key factor was the ability to use natural variation in adaptation speed: experienced users immediately received relevant recommendations, while newcomers required 5-7 sessions to accumulate signal, creating a "natural control" for assessing the pure effect of the system without distortions from the novelty effect.

The analysis revealed that the true effect of personalization is +8.3% to purchase conversion and +12% to average check, but only starting from the 21st day after the user's first visit. In the first two weeks, there was a paradoxical decline in conversion by 3% among new users due to the cold start model, compensated by a surge in activity among regular customers (+15%). Without accounting for the temporal structure of the data, the business could mistakenly reverse the change without waiting for metrics to stabilize, potentially leading to a loss of projected annual revenue of 240 million rubles.

What candidates often overlook

How to properly account for the model training period in the absence of a clear separation between training and test samples in production?

Candidates often ignore that ML models in production are in a state of continuous online learning (online learning), where hyperparameters are adapted to streaming data in real-time. The correct approach includes modeling the learning curve through the evaluation of recommendation quality (NDCG, MAP) as an intermediate mediating variable. It is necessary to build a two-step model where the effect of time on recommendation quality is first assessed, and then the effect of quality on business metrics, using instrumental variables to resolve endogeneity. Without this, the analyst may confuse the algorithm improvement effect with the user data accumulation effect, leading to incorrect conclusions about the optimal evaluation horizon.

Why is it critically important to check the parallel trends assumption in quasi-experiments with personalization not only before but also after the intervention?

Standard practice for parallel trends assumption testing in DiD is limited to the pre-intervention period; however, in personalization systems, there is a risk of trend divergence after implementation due to varying demand elasticity among segments. For example, high-value users may accelerate their purchases under the influence of personalization, while churned users may continue to decline in activity linearly. Candidates should utilize the event study method with dynamic effects (dynamic DiD) to visualize trend deviations in the post-period and apply corrections for heterogeneous treatment effects through models with fixed user and time effects.

How to avoid Simpson's paradox when aggregating results across segments with different baseline conversion rates and varying sensitivity to personalization?

A common mistake is calculating a weighted average effect across the entire audience without accounting for compositional shifts in traffic structure. If personalization is implemented during a period of increasing shares of new users (with low baseline conversion and high relative increase from recommendations), the aggregated effect may appear negative even with a positive effect in each segment. It is necessary to apply stratification followed by standardized averaging (standardized mean treatment effect) or use doubly robust estimation, which combines a propensity scoring model with an outcome model, ensuring robustness against specification errors.