Historical Context
Early retention strategies used mass discount emails to all users with declining activity. This led to non-targeted spending of the marketing budget and the formation of a behavioral pattern of "expecting discounts." With the advent of Uplift Modeling and Propensity Score methods in the 2010s, companies began to target only users with a high likelihood of churn. However, this gave rise to fundamental assessment problems, as the treatment group self-selects through the model, violating the randomization assumption needed for causal inference.
Problem Statement
The key challenge lies in establishing a valid counterfactual scenario for users marked by the churn prediction model as high-risk. These users are systematically different from the general population—they have lower engagement, recent negative experiences, or specific behavioral patterns. A simple comparison of their retention with low-risk users or with their own history before intervention mixes the treatment effect with inherent differences. Additionally, refraining from retention offers for users with maximum churn risk (the control group) creates unacceptable business risk and revenue loss, making classic A/B testing politically impossible.
Detailed Solution
Apply Regression Discontinuity Design (RDD) around the threshold value of the risk score (e.g., 0.7) that triggers the intervention. Users just above and just below the threshold are statistically similar, except for treatment assignment. This provides a local average treatment effect (LATE) for marginal users. To generalize to the entire high-risk population, combine RDD with Inverse Probability Weighting (IPW) using propensity scores estimated on pre-intervention data. For users far above the threshold, use Doubly Robust Estimation or Causal Forests to model heterogeneous effects. To handle data contamination from previous campaigns during training, implement a "shadow mode," where the model generates predictions without triggers for a small holdout (5-10%), creating a tool for Two-Stage Least Squares (2SLS) analysis. Finally, account for communication channel saturation by using Difference-in-Differences (DiD) to compare temporal trends between risk segments.
A mobile subscription service (meditation app) implemented ChurnGuard—an ML system that triggers personalized push notifications with a 30% discount for users with a predicted churn probability exceeding 0.75 over the next 7 days.
Option 1: Simple comparison of retention between those who received the discount (high risk) and those who did not (low risk)
Pros: Instant computation using existing BI tools; does not require experimental infrastructure. Cons: Strong self-selection bias—high-risk users naturally churn more often; comparison underestimates effect or may even show a negative correlation (treated users still churn more often than untreated low-risk users).
Option 2: Randomized controlled experiment where 50% of high-risk users are randomly deprived of the retention offer
Pros: Unbiased causal assessment; clear interpretation of average treatment effect (ATE). Cons: Business stakeholders rejected due to fear of losing valuable users; ethical issues of intentionally allowing churn with intervention present; sample size issues for the high-risk segment.
Option 3: Regression Discontinuity Design using a threshold of 0.75 for the model plus Synthetic Control Method for validating time series
Pros: Ethically acceptable—users just below the threshold receive standard experiences; exploits the existing algorithmic threshold as a natural experiment; can be implemented retrospectively on historical data. Cons: Estimates only the local effect (for users at the threshold); requires thorough verification of continuity assumptions (absence of score manipulation); less precise than RCT due to a smaller effective sample size in the bandwidth.
Chosen Solution and Justification
Option 3 with a bandwidth of 0.05 around the threshold, supplemented with Cohort Analysis comparing users one week before and after the model deployment, adjusted for seasonality using Propensity Score Matching based on behavioral features. The reason for the choice: Balanced statistical rigor with business constraints; allowed measuring the effect without denying treatment to explicitly high-risk users.
Final Result
An 18% relative reduction in churn over 7 days was found for users at the threshold (risk score 0.75-0.80). However, it was revealed that for users with a risk greater than 0.90, returns decrease due to "anxiety fatigue" from multiple retention pushes. Optimized the frequency limit to a maximum of 2 pushes per week. The net effect on LTV was +$1.2M over 3 months with ROI of 340% on discount costs.
Why comparing the retention rate between users who received the retention campaign and those who did not (even within the high-risk segment) may overestimate or underestimate the true effect of the intervention?
Even within the high-risk segment, the timing when a user enters this segment is crucial. Users who reach the risk threshold earlier in their lifecycle are fundamentally different from those who reach it later. Without accounting for Time-Varying Confounders (e.g., recent app failures or seasonal events that simultaneously increase risk and make discounts more/less effective), simple comparisons suffer from Survivorship Bias and Simpson's Paradox. The correct approach requires the use of Marginal Structural Models (MSM) with inverse probability weighting to address time-dependent covariates.
How does the problem of "data leakage" in the training dataset for the churn model distort the assessment of the effectiveness of the churn prevention system itself?
If the churn model was trained on historical data where some users had already received retention campaigns, the labels of the target variable are contaminated. The model learns to identify "users saved by previous campaigns," rather than "users who would have naturally churned." This creates a Feedback Loop, where the model artificially performs well on validation (predicting low churn for treated users) but fails to identify truly at-risk users in production. To fix this, it is necessary to use only pre-intervention data for training or to apply Importance Sampling to reweight training data by the inverse probability of receiving past treatments, effectively simulating the absence of campaigns in the past.
Why standard A/B testing with user-level randomization may be inapplicable for evaluating churn prevention systems, and what alternative experimental designs should be used?
Standard A/B testing is often inapplicable because refraining from treatment in the control group violates the principle of Individual Equipoise (intentional acceptance of harm when intervention is present) and suffers from Spillover Effects (treated users may share promo codes with the control). Instead, use Cluster Randomization (randomization by geographic regions or time periods through Switchback Experiments) or Encouragement Designs, where the instrument is the right to participate in the model, not the treatment itself. Another approach is Partial Population Experiments, where the model operates in "shadow mode" for the control group (predictions are made but no actions are taken), allowing a comparison of predicted and actual churn through Calibration Analysis to measure the true lift.