Answer to the question.

Personalizing push notifications requires a rigorous quasi-experimental approach due to user self-selection based on activity timing. Possible cross-contamination through social networks or family accounts further complicates isolating the effect.

The key method is Difference-in-Differences (DiD) with synthetic control. The control group is formed based on propensity score matching by app opening times and historical order patterns.

To adjust for time-of-day effects, stratification by time zones is applied. Cross-contamination is detected through analysis of device IDs and IP addresses for shared accounts.

The retention metric is calculated as a hazard ratio using the Cox proportional hazards model. This allows accounting for censored data and heterogeneity in churn risks.

Real-life situation

In the app Delivery Club, it was planned to implement an ML model in Python using CatBoost for personalizing the timing of push notifications. The problem was that active users predominantly opened the app during lunch hours, creating a self-selection bias.

A partial rollout to 20% of the audience triggered a "word-of-mouth" effect. Users from the control group learned about promotions from colleagues, creating cross-contamination.

The first considered solution was a classic A/B test with geographic segmentation. City A served as the test group, while city B was the control.

Pros of this approach included clean isolation of groups and simplicity of result interpretation for the business. Cons included differences in culinary preferences and incomes between cities, creating a bias of 12-15% in baseline retention.

The second option was to analyze only users with notifications turned on (per-protocol analysis). This allowed focusing on the target audience responding to communications.

Pros — high relevance for the product team. Cons — ignoring the effect of opt-out bias: users who turned off notifications had a 3 times higher baseline churn, distorting the overall effect of the intervention.

The third solution was Causal Impact from Google with synthetic control construction. Bayesian Structural Time Series were utilized for counterfactual modeling.

Pros included accounting for temporal trends and seasonality without the need for explicit control. Cons — high sensitivity to covariate selection and fragility of the assumption of parallel trends before the intervention.

The chosen approach became a combined method: Inverse Probability Weighting (IPW) to adjust for self-selection based on activity timing plus Diff-in-Diff with clustered standard errors at the level of geographic clusters.

This solution maintained individual variability in sending times, which is critical for personalization. At the same time, it provided control over inter-group spillovers through cluster robustness.

The result was the identification of a true incremental effect of +8.3% on 7-day retention. Naive comparisons indicated +15%. The effect turned out to be statistically significant only for the segment "users with 3+ orders in history."

This allowed optimizing the budget for campaigns by excluding cold users from the target audience of personalized campaigns.

What candidates often miss

How to correctly account for seasonality when calculating LTV forecasts for a subscription product with annual and monthly plans amid cohort heterogeneity?

Novices often use simple averaging of historical retention curves without considering that users arriving during Black Friday have a qualitatively different retention profile. Their churn is 2-3 times higher than that of organic users.

The correct approach is to build separate BG/NBD or Gamma-Gamma models for each cohort, accounting for seasonal dummy variables. An alternative is using Cohort-Based LTV with adjustments using Bayesian Hierarchical Modeling for strength borrowing between cohorts (partial pooling).

What is the difference between intent-to-treat (ITT) and treatment-on-the-treated (TOT) analysis when assessing the onboarding tour effect, and when to apply which approach?

ITT analyzes the effect of the offer to undergo onboarding to all users in the test group, including decliners. TOT measures the effect of actually undergoing the tour (complier average causal effect).

ITT is conservative and suitable for business decisions on scaling the function. It reflects the real behavior of the audience considering friction. TOT requires instrumental variables and answers the question of whether forced onboarding is advisable.

An error in choosing the method leads to a 40-60% overestimation of the effect. For TOT, random bugs in tour presentation can be used as an instrument.

How to diagnose the "peeking" problem when conducting sequential A/B testing, and what statistical adjustments should be applied?

Peeking occurs when a test is prematurely stopped upon reaching significance. Diagnosis involves analyzing p-values over time: with peeking, the curve demonstrates "smooth wandering" with frequent crossings of the 0.05 threshold.

Solutions include Group Sequential Testing with alpha-spending functions (O'Brien-Fleming). An alternative is Bayesian A/B Testing with the ROPE (Region of Practical Equivalence) approach.

Fixing the sample size through Data Quality Gates in Apache Airflow is also effective. A critical error is using naive confidence intervals without Bonferroni adjustment, inflating the false positive rate to 25-30% with 5 intermediate checks.

What approach would you choose to evaluate the incremental effect of personalized push notification campaigns on 7-day user retention in a food delivery mobile app, considering time-of-day effects and cross-contamination between segments during a partial rollout?