Product Analytics (IT)Product Analyst

What method should be used to assess the causal effect of implementing the 'Saved Shopping Lists' feature with ML restocking recommendations on the frequency of repeat orders and average check, given that the creation of lists self-selects highly engaged planning users and the effect depends on the seasonality of product categories and limited shelf life of products?

Pass interviews with Hintsage AI assistant

Answer to the question

Historical Context

The evolution of e-commerce from impulse buying to planned consumption began with the introduction of Amazon Subscribe & Save in 2008, when retailers realized that retention through reducing cognitive load during repeat purchases was more effective than aggressive discounts. By 2015, smart lists with Machine Learning forecasting of restocking appeared, analyzing the intervals between purchases of milk or diapers. However, early assessments of effectiveness faced a fundamental problem: users who create lists initially displayed higher planning discipline and loyalty, making direct comparison with a 'cold' audience inappropriate in terms of causal connection.

Problem Statement

The key difficulty lies in the endogeneity of self-selection: creating a list is not a random impact but a result of the user's conscious intention to optimize their spending. This leads to sample bias, where 'treatment' (having a list) correlates with unobserved characteristics (organization, family size, regularity of consumption). Additionally, temporal dynamics come into play: the effect of lists for perishable products (weekly restocking) differs from the effect for seasonal goods (New Year decorations), and ML recommendations may induce cannibalization of spontaneous additions to the cart, distorting the overall revenue analysis.

Detailed Solution

The optimal approach is a combination of Difference-in-Differences (DiD) with Propensity Score Matching (PSM) and Fixed Effects to control for seasonality. In the first stage, we use Causal Forest to assess the heterogeneity of the effect across product categories, identifying segments where lists genuinely increase frequency rather than merely recording existing behavior. To isolate the causal link, we apply Regression Discontinuity Design (RDD) at the threshold of previous orders, where the 'Saved Lists' feature becomes available (e.g., after the third order), creating quasi-experimental conditions for local randomization. Alternatively, with gradual implementation across regions, we use the Synthetic Control Method, constructing a weighted combination of control regions that mimic the dynamics of the test region before implementation. To account for cannibalization, we analyze not only the metrics of list users but also the Diversion Ratio — the share of orders flowing from spontaneous sessions into planned purchases via lists.

Real-Life Situation

Context: The hypermarket 'AlwaysFood' launched the 'Smart Refrigerator' feature — automatic restocking lists based on AI analysis of purchase history and expiration dates. The goal was to increase order frequency by 20% by reducing friction in repeat purchases of household goods and food products.

Solution Option 1: Direct Comparison of Users with and without Lists (Before-After)

The analytics team proposed comparing the average check and order frequency of 10,000 users who created lists in the first week with a control group of random users without lists. The pros of this approach are maximum implementation simplicity and fast result retrieval. The cons are catastrophic sample bias: list creators turned out to be families with children ordering weekly, while the control group included random visitors with one-time orders. The observed increase of 35% turned out to be an artifact of self-selection rather than an effect of the feature.

Solution Option 2: Forced A/B Testing with Button Visibility

The product team suggested showing 50% of users the 'Create List' button in bright green while the other 50% received it in gray and hidden in the menu, creating a difference in penetration. Pros — the ability to evaluate the pure effect of feature availability. Cons — ethical and UX risks: hiding a useful feature from loyal users diminished their interaction experience, and low conversion to list creation (2% vs 15% in the test) led to insufficient statistical power and the inability to assess long-term habituation effects.

Solution Option 3: Regression Discontinuity Design at Activity Threshold (Selected Solution)

Analysts chose the regression discontinuity method, using a threshold of 3 orders in 60 days: users reaching this threshold automatically received access to the 'Smart Refrigerator' with ML recommendations, while users with 2 orders did not. This created quasi-experimental conditions for local randomization near the threshold. Pros — minimizing self-selection bias in the narrow band around cutoff (users with 2 and 3 orders are statistically indistinguishable in their observed characteristics). Cons — limited generalizability of results only to 'borderline' users, not the entire base; the necessity to check the continuity of covariates distribution around the threshold.

Final Result: The analysis showed a true increase in order frequency of 12% (instead of the apparent 35%) and an average check increase of 8% only for the 'Household Chemicals and Paper Products' category. For perishable products, the effect was statistically insignificant due to physical limitations of shelf life. It was identified that 30% of revenue growth was caused by cannibalization of spontaneous purchases that flowed into planned ones. Based on data, the company adjusted its ML model, excluding impulsive categories (sweets, chips) from recommendations, which preserved overall revenue growth but increased user satisfaction, as the 'Smart Refrigerator' stopped 'suggesting' unhealthy habits.

What Candidates Often Miss

Why can't we just compare metrics of users with lists and without them through a standard t-test or linear regression?

The answer lies in the fundamental problem of endogeneity and self-selection bias. Users who take the time to create structured lists systematically differ from random visitors based on unobserved characteristics: they have higher planned consumption, larger family sizes, and greater predictability of life schedules. OLS regression, even controlling for demographics, cannot capture 'planning culture' as a latent variable. This leads to an overestimation of the feature's effect, as high metrics are explained not by the lists themselves, but by the initially high engagement of users. For an accurate assessment, it is essential to use instrumental variables (IV), quasi-experimental designs (RDD, DiD), or difference-in-differences methods with matching (PSM-DiD) that isolate variation independent of individual preferences.

How to separate the effect of 'planning' type users from the true effect of the lists feature when analyzing the intensive and extensive fields of impact?

It is necessary to separate intensive margin (increased frequency among those who were already planning purchases) and extensive margin (attracting impulsive buyers to planning). For this, Causal Forest or Heterogeneous Treatment Effects analysis is used, allowing assessment of the effect across subgroups. A key insight — using ordinal logistic regression with dummy variables for the number of lists created. If the feature works, we will see a significant increase in metrics when transitioning from 0 to 1 list (extensive margin), but no significant changes when transitioning from 5 to 6 lists (intensive margin, where self-selection predominates). It is also important to analyze time-to-event (time until the next order) via Cox Proportional Hazards Model, controlling for the baseline churn risk, which allows separating 'natural' regularity from 'artificial' system prompting.

How to correctly account for cannibalization between planned purchases through lists and spontaneous add-to-cart, when lists may simply divert revenue from one channel to another without increasing overall GMV?

Candidates often ignore the need to analyze diversion ratio and basket composition. It is necessary to construct a triple-difference model (DiD with an additional measurement), comparing changes in the basket structure among users with lists before and after implementation, relative to a control group. It is crucial to track the metric 'share of wallet' — the share of categories traditionally bought spontaneously (sweets, snacks) in the overall check. If the share of impulse categories drops among list users but rises among controls, this signals cannibalization. For quantitative assessment, Almost Ideal Demand System (AIDS) or Rotterdam Model are used, evaluating the substitution elasticity between purchase channels. Without this analysis, the company may mistakenly invest in the development of the lists feature, obtaining a zero incremental effect at the business level, despite an increase in metrics among the 'list' user segment.