Answer to the question

Historically, the evaluation of loyalty programs relied on a simple comparison of average order values between participants and non-participants, leading to an overestimation of the effect due to selection bias. Modern product analytics require isolating the true causal effect in conditions where users self-select into the program based on unobserved characteristics (e.g., planned purchase volume). The key issue is separating the program's effect from pre-existing differences between groups and correctly handling time lags between the accumulation and activation of bonuses.

To address this, a combination of Propensity Score Matching (PSM) and Difference-in-Differences (DiD) with extended specifications of time effects is necessary. In the first stage, a model of the probability of joining the program is built based on covariates before the launch (purchase history, demographics, engagement). Users are matched by nearest neighbor or weights (IPW) to balance the distribution of observed characteristics. In the second stage, DiD with fixed effects for users and time is applied, where periods are divided into buckets relative to the cashback activation moment (event study design). This allows tracking the dynamics of the effect, considering that some users activate bonuses after a week, while others do so after a month. To control for cannibalization (temporal shift of purchases), lags of the dependent variable are included and cohorts with different observation horizons are analyzed through Survival Analysis.

Real-life situation

We launched a 5% cashback accumulation program in an electronics marketplace where users had to activate the option in their profiles. After a month, metrics showed a 40% increase in purchase frequency among participants, but the business doubted the causality, as it was assumed that initially loyal users enrolled in the program. The problem was complicated by the fact that bonuses could only be spent 14 days after being credited, creating an artificial spike in activity in the third week.

The first option considered was a classic A/B test with forced randomization of access to cashback. Pros: clean evaluation of the causal effect. Cons: legal restrictions (financial programs cannot be enforced without consent) and behavior distortion (users who learned of the cashback unavailability left for competitors). This option was rejected due to ethical and business risks.

The second option was a simple comparison of "participants vs non-participants" via t-test with a correction for sample size. Pros: speed of implementation and simplicity of reporting. Cons: catastrophic survivorship bias and ignoring endogeneity; the analysis showed that participants had a 2.3 times higher baseline purchase frequency before activation, making the comparison incorrect.

The third option was Regression Discontinuity Design (RDD) based on the threshold of the first purchase amount that automatically entitled users to cashback. Pros: local randomness around the threshold provides an unbiased estimate for marginal users. Cons: the estimate is valid only for a narrow group at the threshold (local average treatment effect), and not for the entire audience; moreover, in our case, there was no hard threshold—the program was available to everyone immediately after opt-in.

The chosen solution is a combination of Propensity Score Matching to create synthetic control and Cohort-based Difference-in-Differences taking into account time lags. We matched participants with non-participants across 15 variables (RFM segments, seasonality, device) and then applied DiD with fixed effects for week and user. To account for the 14-day delay, we built an Event Study with bins relative to the activation moment, allowing us to separate true growth from purchase shifts. The result: the net incremental effect amounted to +12% in purchase frequency and +8% in average order value after accounting for cannibalization, while the raw data indicated +40%. The program was deemed successful, but with significantly more modest ROI expectations.

What candidates often overlook

How to correctly distinguish the program's effect from temporal purchase shifts (intertemporal substitution) when there are lags between the accumulation and redemption of bonuses?

The answer requires an understanding of Dynamic Treatment Effects. It is necessary to model not only the average effect but also its dynamics through Event Study specification: Y_it = α_i + γ_t + Σ_k β_k · D_i,t-k + ε_it, where D_i,t-k are dummy variables relative to the activation moment. If the coefficients β_k before activation do not significantly differ from zero (parallel trends test), and after activation show a spike followed by a drop below the baseline level—this indicates cannibalization (borrowed demand). To assess the pure LTV effect, you need to integrate the effect over time and compare it with the counterfactual through Synthetic Control Method, built on donor units with similar pre-treatment trajectories.

Why can a standard A/B test with individual randomization violate the SUTVA assumption in cashback systems?

SUTVA (Stable Unit Treatment Value Assumption) is violated when one user's bonuses influence the behavior of others through networks (e.g., family accounts or corporate purchases). If a husband activates cashback and makes a purchase for the family while the wife stops her separate purchases, individual randomization will yield a biased estimate. It is necessary to implement Cluster Randomization at the household level or use diffusion analysis methods (Spillover Effects), such as Two-Stage Least Squares (2SLS) with instrumental variables (e.g., threshold values for activation varying between clusters).

How to account for the heterogeneity of effect over the user lifetime (customer lifetime stage) given seasonality?

Candidates often overlook that the cashback effect differs for new users (primary motivation effect) and mature users (retention effect). It is necessary to apply Triple Difference (DDD): the program effect = (Y_post - Y_pre) for treatment - (Y_post - Y_pre) for control, differentiated by tenure segments (new/mature). Seasonality is controlled through fixed effects of the month interacting with the segment. Alternatively, you can use Heterogeneous Treatment Effects via Causal Forests or Meta-learners (S-learner, T-learner), which allows identifying segments with positive CATE (Conditional Average Treatment Effect) and optimizing program targeting on them, avoiding costs on users with zero or negative effects.