Historical context. In classic online retail, out-of-stock (OOS) items in orders were traditionally resolved either by canceling the item or through a manual call from a manager, critically reducing conversion rates and customer satisfaction. With the development of ML recommendation systems, it became possible to offer real-time substitutions based on semantic similarity, price parity, and substitution history. However, simply comparing orders with substitutions versus those without yields a biased estimate since the existence of a substitution correlates with the initial product shortage, and users who opt for substitutions systematically differ from those who reject them.
Problem statement. The key difficulty lies in the endogeneity of self-selection: loyal users are more likely to allow substitutions, while random shortages affect the sample unevenly across categories (perishables vs electronics). Additionally, the implementation occurs at the warehouse level, which excludes classic A/B testing at the user level due to contamination through shared inventory. It is necessary to isolate the pure effect of ML substitution quality from the baseline negativity of product unavailability and account for heterogeneity across categories.
Detailed solution. The optimal approach is a combination of Difference-in-Differences (DiD) at the warehouse level and Causal Forest for estimating effect heterogeneity. For warehouses with ML substitutions (treatment), a control group is selected using the Synthetic Control Method, utilizing warehouses without substitutions that have a similar demand structure and seasonality. For users within treatment warehouses, Propensity Score Matching is applied to match those who accept and reject substitutions based on historical characteristics (order frequency, average check, category preferences). The effect is evaluated as Conditional Average Treatment Effect (CATE) broken down by substitution categories (high/medium/low), allowing the separation of technological effect from selection bias.
The company “ProductPlus” implemented a smart substitution system for out-of-stock items in online orders. The problem was that 15% of orders contained out-of-stock items, leading to user attrition. Analysts needed to measure whether ML substitutions indeed reduce the negative effect of shortages or merely mask procurement issues.
First option — classic A/B testing on users, splitting into groups of “substitution enabled” and “disabled”. Pros: straightforward interpretation and direct comparability of conversion metrics. Cons: impractical in reality as one warehouse serves both groups, and if an item is out of stock, it cannot be “returned” for the control group, creating logistical collapse and contamination.
Second option — comparison of “before and after” in the same warehouses without a control group. Pros: simplicity of calculation and lack of need for synchronization with other warehouses. Cons: seasonality of demand for products and changes in the product mix distort the result, making it impossible to separate the effect of the function from the overall growth of the base.
Third option — quasi-experimental Difference-in-Differences design using urban micro-warehouses as randomization units, where treatment warehouses received the ML model, while control ones remained on manual approval. Pros: eliminates systematic trends and seasonality, enabling statistically significant conclusions. Cons: requires strict assumptions about parallel trends and a sufficient number of homogeneous warehouses to build synthetic control.
The chosen solution: the team selected the third option with the additional application of Causal Forest to segment users by their likelihood of accepting substitutions. This allowed isolating the effect for “conservatives” and “early adopters” separately, adjusting for prior order history through Propensity Score Matching.
Final result: it was determined that ML substitutions increased retention by 12% only for categories with high substitutability (dairy, groceries), but decreased satisfaction by 8% for niche products (craft beer, organic items), where substitutions were perceived as intrusive. The company limited substitutions to categories with high correlation of preferences, resulting in a rise in NPS by 0.4 points and a 23% reduction in operational costs on manual restocking.
How to distinguish the effect of the substitution technology itself from the effect of the specific ML model quality and avoid survivorship bias?
Answer. Candidates often confuse the technological effect (the ability to substitute per se) with the quality effect (the accuracy of selection). To separate them, it is necessary to construct a dose-response function, where the “dose” is the likelihood of relevance of the substitution based on the model metric (NDCG@1). Employing Fuzzy Regression Discontinuity around the acceptance rate model threshold (for example, substitutions with confidence > 0.8 vs 0.6) can isolate the pure quality effect from the effect of the function's existence. It is essential to consider survivorship bias: users who received poor substitutions in their first order might disable the function permanently, skewing the sample in favor of successful cases. To correct for this, the Heckman selection model is applied, modeling the selection equation (the probability of remaining in the sample after the first experience) and the outcome equation (satisfaction) jointly.
How to account for cross-contamination between categories when an unsuccessful substitution in one category affects the perception of the entire order and leads to the cancellation of other items?
Answer. The standard approach assesses the effect of a category in isolation, ignoring negative spillover on the cart. To account for inter-category effects, it is necessary to model the order as a system of interdependent items, using Graph Causal Models or Structural Equation Modeling (SEM). Specifically: a graph of category dependencies is constructed (for instance, substituting yogurt influences the perception of muesli), and the effect is evaluated through Total Treatment Effect with control of neighboring item covariates. Alternatively, Mediation Analysis is employed, where the mediator is the “disappointment flag” (removal of other items from the cart after a substitution is shown). This enables the decomposition of the total effect into direct (within the category) and indirect (through changes in the cart) effects, avoiding an overestimation of the benefit from substitutions.
How to correctly interpret the results if the ML model demonstrates dynamic learning (learning effects), and the quality of substitutions improves over time, creating a temporal trend in the treatment group?
Answer. Beginning analysts tend to ignore the non-stationarity of the effect, assuming a constant ATE across the entire observation horizon. In the case of dynamic model learning, the effect “today” systematically differs from the effect “a month ago,” violating the Stable Unit Treatment Value Assumption (SUTVA) about temporal stability. The solution is to apply Time-Varying Coefficient Models or Bayesian Structural Time Series (BSTS) by modeling the trend effect as a latent variable. Within DiD, it is necessary to include the interaction of time and treatment (event study design), testing the hypothesis of parallel trends for each temporal slice. If the effect is growing, it is important to distinguish the learning curve of the model (algorithm improvement) from user adaptation (users getting used to the function), using different cohorts of users and model version cohorts for decomposition.