Answer to the question

Historical context

Traditionally, product teams evaluated the effectiveness of onboarding by comparing the retention of users who completed the training with those who skipped it. This approach led to significant misinterpretations: the observed correlation between tutorial completion and retention reflected not a causal effect of learning but the selection of high-motivated users. With the development of Causal Inference, it became standard in the industry to differentiate between intention-to-treat (ITT) and treatment-on-the-treated (TOT), as well as to use natural experiments when classical randomization is not feasible.

Problem Statement

The key complexity lies in the endogeneity of self-selection: the decision to undergo onboarding correlates with unobserved user characteristics (motivation, patience) that simultaneously affect future retention. A simple comparison of groups leads to survivorship bias and an overestimated effect. Furthermore, the phased rollout across regions creates an opportunity for a quasi-experiment, but regions differ in cultural factors and baseline metrics, which necessitates controlling for confounding variables.

Detailed Solution

It is necessary to apply Two-Stage Least Squares (2SLS) using the regional implementation flag as an Instrumental Variable (IV). In the first stage, the probability of completing onboarding (compliance) is modeled based on the region's launching of the feature. In the second stage, the predicted values are used to assess the effect on retention. To account for regional heterogeneity, Difference-in-Differences (DiD) with fixed effects for regions and time is applied. Additionally, a Causal Forest is constructed to estimate the Conditional Average Treatment Effect (CATE) and to identify segments where onboarding yields the maximum benefit. It is crucial to control for the pre-trend of parallelism before implementation and to check the exclusion restriction for the instrument.

Real-life Situation

A mobile language learning app team implemented a mandatory 3-minute interactive tutorial before granting access to free content. The pilot launch showed that users who completed onboarding had a 35% higher 7-day retention than those who closed the app at the tutorial stage. The business wanted to scale the feature to all users, but the analyst suspected a survivorship bias.

Option 1: Simple comparison (naive approach). Comparison of retention between users with completed onboarding vs skipped. Pros: Instant calculation, clear uplift metric. Cons: Critical selection bias; users willing to spend 3 minutes upfront are already more engaged; the estimate is overstated by 3-4 times; regional differences in tolerance to friction are not accounted for.

Option 2: A/B test with mandatory onboarding. Randomization at the user level: group A sees the mandatory tutorial, group B – content right away. Pros: Clean randomization eliminates selection. Cons: Non-compliance in group A (some users close the app and do not return) creates asymmetric attrition; ITT analysis provides a conservative estimate but does not answer the question of the effect for those who actually completed training; potential negative spillover in social networks.

Option 3: Regression Discontinuity Design (RDD) by time. Using the exact moment the feature was launched in a region as a cutoff. Pros: High internal validity for users "on the margin"; does not require a control group within the region. Cons: Local effect (LATE) cannot be generalized to all users; requires a high density of data near the cutoff; seasonality and the day of the launch can distort results.

Chosen solution: Combination of IV-approach with regional rollout and Doubly Robust Estimation.

Regions with the launched onboarding were used as an instrument for actual tutorial completion (relevance condition verified with a correlation of 0.82). 2SLS was applied to assess the effect specifically for compliers (those who would have completed onboarding only if it were mandatory). Additionally, a Synthetic Control was built for each treated region, using a weighted combination of control regions with similar pre-trends.

Final Result: The true causal effect was +8% to 7-day retention instead of +35% in the raw data. It turned out that onboarding is effective only for users with low initial engagement (CATE = +15%), but creates friction for power users (CATE = -3%). An adaptive system was implemented: onboarding was shown only to users with low predicted engagement scores based on the first 10 seconds of the session. This resulted in +12% to global retention without losing power users.

What candidates often overlook

Why does the A/B test with mandatory onboarding give a biased estimate even with randomization, and how to correctly interpret the results?

Answer: The problem of non-compliance and differential attrition. Even with random assignment to the test group with mandatory onboarding, part of the users leaves forever (never-takers), while the control group does not face such a "penalty" for refusal. This creates asymmetric survivorship bias. For a correct estimate, it is necessary to calculate the Intent-to-Treat (ITT) effect as the difference between groups based on actual assignment, and then use the Wald estimator to obtain the Complier Average Causal Effect (CACE): CACE = ITT / (share of compliers). It is important to check that the share of compliers is sufficient (>20%); otherwise, the estimate will be unstable (weak instrument problem).

How to diagnose and correct negative spillover effects when users from control regions learn about the new onboarding and change behavior before the actual launch?

Answer: This violates SUTVA (Stable Unit Treatment Value Assumption). For diagnosis, event study graphs of installations in control regions are analyzed for abnormal decreases (chilling effect) before rollout. If spillover is confirmed, spatial Difference-in-Differences is applied, where control is provided only by remote regions without social ties, or a partial population experiment is conducted with processing a random subsample of users within the region. Alternatively, two-way fixed effects with interaction of distance to the nearest treated region as a controlled variable can be used.

Why is it important to distinguish between short-term friction and long-term value accumulation when choosing an observation horizon, and what methods allow evaluating the long-term effect with limited data?

Answer: Onboarding creates short-term friction, mechanically reducing day-0 retention, but accumulates long-term value through better product understanding. Evaluation in a short window (1-3 days) may show a negative effect due to churn of low-motivated users who would have had a low LTV anyway. To assess long-term effects with limited data, a Surrogate Index is used: a model linking short-term metrics (depth of the first session, number of features viewed) with long-term outcome (30-day retention) is built on historical data before implementation. Then the effect on the surrogate, which proxies the long-term effect, is estimated. It is essential to check the unconfoundedness of the surrogate through sensitivity analysis.