Answer to the question

Historically, marketing campaigns were evaluated through the average treatment effect (ATE), but the development of Causal ML has led to uplift models that predict the individual treatment effect (ITE). The classic A/B test here is paradoxical: training the model requires data from both treated and control groups across all segments, but to assess the model, it needs to be applied, which destroys the control group. This creates a dilemma of exploration and exploitation.

The problem is further complicated by contamination, where the behavior of users in the test group affects the control through network effects or shared resources (e.g., depletion of promo code limits). A method is needed that allows for simultaneous model training and isolation of its incremental effect compared to a uniform distribution or absence of a campaign.

The solution is based on a Two-Stage Approach. The first stage is exploration with randomization (20-30% of traffic) to collect unbiased data, training the model (X-learner or R-learner) to estimate CATE (Conditional Average Treatment Effect). The second stage is exploitation with a gradual shift of traffic to the model through Thompson Sampling or Contextual Bandits, which minimizes regret. To isolate the effect, Cluster-based Randomization (randomization by geographic clusters) or Switchback Testing (temporal randomization) is used with subsequent evaluation through Synthetic Control Method (SCM). Quality metrics include Qini-coefficient or Area Under the Uplift Curve (AUUC), adjusted through Inverse Propensity Weighting (IPW) to eliminate selection bias.

Life Situation

The problem arose in a marketplace during the launch of a campaign with personalized promo codes. The product manager wanted to use the uplift model to send discounts only to "persuadables" (those who will only buy with a promo code), avoiding "sure things" and "lost causes". A standard A/B test was impossible as data on those who did not receive the promo code was required in all segments, but retaining 50% of the audience without promo codes critically reduced revenue.

Option one — Hold-out Randomization with 10% of users kept in full control throughout the period. Pros of the approach: a clean estimate of ATE and the possibility of correctly training the model on contrast. Cons: significant opportunity costs, ethical conflicts (price discrimination without transparent criteria), and slow model convergence due to the small control group size.

Option two — Thompson Sampling with a gradual increase in traffic share. Here, the "arms" of the bandit are targeting strategies (uplift model vs. random). Pros: optimal exploration/exploitation ratio, adaptation to seasonality, and minimization of economic losses. Cons: difficulty in interpretation at early stages, risk of getting stuck in a local optimum with a poor choice of contexts, and the need for large volumes of traffic for statistical significance.

Option three — Geo-based Synthetic Control. Randomization was conducted by regions: the uplift model was used in test regions, and the old system in control regions. SCM was used for assessment, creating a weighted combination of control regions mimicking tests before implementation. Pros: isolation of the effect from individual randomization, working with aggregated data, and absence of cross-contamination between cities. Cons: stability of regions over time is required, sensitivity to outliers in small geographic units, and the assumption of parallel trends, which is often violated in periods of high seasonality.

A combined solution was chosen: Geo-cluster Randomization with Synthetic Control for offline validation and Thompson Sampling for online optimization within test clusters. Rationale: geographic randomization excluded cross-contamination (users from different cities rarely interact), and Synthetic Control helped avoid a 50/50 split. Thompson Sampling within test regions enabled quick adaptation of the model to local preferences.

Result: it was possible to isolate the true incremental effect of the uplift model at +12% to conversion compared to mass mailing while reducing promo code costs by 35%. Synthetic Control showed that without the model, the trend in test regions would replicate the synthetic control dynamics with 94% accuracy (RMSPE), confirming the validity of the estimate.

What Candidates Often Miss

Why can’t we simply compare the conversion of those who received the promo code by the model with those who did not (observational data), even if using Propensity Score Matching?

Answer: Self-selection bias and unobserved confounders. Users with high uplift scores may systematically differ in unobserved characteristics (e.g., recent paycheck or search for a specific product). Propensity Score Matching (PSM) only adjusts for observed covariates, but if there is a hidden variable affecting both the probability of receiving the promo code and conversion, the estimate will be biased. For example, active users with many sessions may be wrongly classified as "persuadables", but they will buy even without a discount. For a novice specialist, it is critical to understand that the correlation between predicted uplift and actual conversion does not equal the causal effect — randomization or instrumental variables (IV) are needed for isolation.

How does temporal dependency (time-varying confounders) affect the estimation of the uplift model over a long training period, and how to address it?

Answer: Long-term training introduces temporal confounding: user behavior changes (seasonality, product updates), and exploration phase data becomes obsolete by the time of exploitation. The classical uplift model assumes stationarity, which is rarely true. The solution is to use adaptive experimentation with decaying weights for old data or online learning algorithms (e.g., Bayesian Updating). Additionally, monitoring for concept drift through Population Stability Index (PSI) for features and model performance is necessary. Novice analysts often train the model on quarterly data and apply it six months later without checking audience behavior drift (e.g., due to a competitor entering the market), leading to negative uplift in production.

Why might the AUUC (Area Under Uplift Curve) metric be misleading when comparing two different uplift models, and what alternatives should be used?

Answer: AUUC depends on the distribution of predicted uplift in the population and is not scale-invariant. If one model conservatively predicts a small uplift for all, while another aggressively predicts high dispersion, their curves will intersect, and AUUC will yield ambiguous results. Moreover, AUUC ignores business constraints (budget for promo codes). An alternative is a cost-sensitive Qini coefficient or Expected Response under a fixed budget. For a novice specialist, it’s essential to understand that a good model by AUUC ≠ a good business metric. It is necessary to use Policy Evaluation by simulating strategy: ranking users by predicted uplift, taking the top-K% (according to the budget), and comparing the actual gain with the counterfactual scenario through Doubly Robust Estimation or Inverse Probability Weighting (IPW).