Answer to the question

The key issue in evaluating referral programs is endogeneity from self-selection: users with high engagement already have higher LTV and invite friends more often, creating an illusion of high channel efficiency. To properly assess, we apply causal inference methods: Propensity Score Matching (PSM) to eliminate bias from observed characteristics, or Instrumental Variables (IV) if there is a randomized instrument (e.g., random display of a banner).

To account for time lags between sending invitations and the conversion of referrals, we use Survival Analysis (Kaplan-Meier model or Cox Proportional Hazards) instead of simple cohort analysis. This allows us to properly handle censored data (right-censoring), where some users have not yet completed their life cycle. LTV is calculated by integrating the retention curve with discounting or using the BTYD method (Pareto/NBD) to predict future transactions.

Real-life situation

Context: In a food delivery mobile app, we launched a referral program with two-sided bonuses. After a quarter, reporting in Tableau showed that users who activated the referral link had an LTV 40% higher than the average on the platform. The Product Manager demanded budget scaling, but the analytics team suspected that the difference was caused not by the program but by the underlying characteristics of super-users.

Problem: It was impossible to separate the true incremental effect from correlation with engagement. Using simple SQL queries to compare groups resulted in a biased estimate due to confounders (order frequency, time in product). Without a proper assessment, the business risked overpaying for a channel with negative or near-zero margin.

Solution 1: Direct cohort comparison through SQL

We compared the "Invited" cohort (treatment) and the "Not Invited" cohort (control) using aggregation in BigQuery, calculating ARPU and retention on the 90th day.

Pros: Instant implementation, clear visualization for stakeholders, low resource requirements.

Cons: Critical self-selection bias and survivor bias. Users who were already planning to stay in the product are more likely to use referrals. The result is overstated and unsuitable for decision-making.

Solution 2: Propensity Score Matching on historical data

In Python (scikit-learn), we built a logistic regression model to assess the propensity score — the probability of participating in the program based on pre-treatment characteristics (account age, order history, average check). We then applied Nearest Neighbors for 1:1 matching and compared LTV only in comparable subgroups.

Pros: Eliminates bias from observed variables (observable confounders), works on retrospective data without the need for experimentation. Quickly provides an estimate of ATT (Average Treatment Effect on the Treated).

Cons: Does not eliminate unobserved characteristics (unobserved confounders), such as extraversion or social capital. With unbalanced data (few inviters), common support issues arise, and part of the sample is discarded, reducing power.

Solution 3: Instrumental variables and Survival Analysis

We found a natural experiment: 50% of users were randomly shown a banner for the referral program on the main screen (instrument Z), which influenced the likelihood of participation (X) but not LTV directly (Y). We estimated the effect using 2SLS (Two-Stage Least Squares) in the linearmodels library for Python, obtaining LATE (Local Average Treatment Effect). To account for lags, we applied Survival Analysis: built a hazard function model of time to the first order of the referral and adjusted LTV for the probability of conversion at each point in time.

Pros: The IV method eliminates both observed and unobserved confounders, providing a causal estimate. Survival analysis properly handles incomplete data and allows modeling temporal dynamics.

Cons: Requires a valid instrument (relevance and exogeneity), which is hard to prove. Reduced statistical power of IV estimates (wide confidence intervals). The interpretation of LATE differs from ATE (average effect only for "compliers").

Chosen solution:

We chose a hybrid approach: used banner randomization for IV estimation of the pure effect of participation, then applied a nonlinear Survival Analysis model (Cox with time-varying covariates) to calculate expected LTV considering referral conversion times. This allowed us to separate the program effect from the self-selection effect.

Result:

The true incremental effect was +12% to LTV for the compliers group, not +40% as in the initial report. Lag analysis showed that 85% of referral conversions occur within the first 14 days after the click, which allowed us to reduce the effectiveness assessment horizon from 90 to 30 days. The business revised its unit economics, reducing customer acquisition costs (CAC) by 18% by eliminating long wait times for retention.

Commonly overlooked by candidates

Question 1: How to verify the SUTVA assumption (no interference between units) in a referral program where there are network effects among inviters?

SUTVA is violated if the density of invitations in a social circle affects conversion probability (e.g., oversaturation or viral effect). To verify, we use clustering: splitting users into geographical clusters or segments based on social graphs through Graph Analysis (NetworkX).

Then, we apply Difference-in-Differences, comparing clusters with high and low penetration of referral links. If the effect in dense clusters significantly differs (lower due to oversaturation or higher due to social proof), SUTVA is violated, and we need to use models with inter-group interactions (spatial models) or limit the analysis to isolated segments.

Question 2: Why can't we use ordinary least squares regression (OLS) for predicting LTV in conditions of censored data, when some users have not yet churned?

OLS ignores the fact of censoring (right-censoring), treating current LTV as final, which leads to systematic underestimation for "young" users. Instead, we apply Survival Analysis to assess the retention curve ( S(t) ), then integrate it to obtain expected lifetime.

Alternatively, we use probabilistic repeat purchase models (BTYD), such as Pareto/NBD or Gamma-Gamma, implemented in the lifetimes library for Python. These models account for unseen transactions through probability distributions of frequency and time between purchases, providing an unbiased estimate of future LTV even for active users.

Question 3: How to distinguish incremental invites (invitations that occurred only because of the program) from organic invites (which would have occurred without stimulation) when assessing the effect?

We use the Principal Stratification framework, dividing the population into four groups (strata): Always-takers (would invite always), Compliers (invited only because of the program), Never-takers, and Defiers. Through IV analysis with a binary instrument (e.g., saw/did not see the banner), we estimate LATE — the effect specifically for Compliers.

For more detailed segmentation, we use Causal Machine Learning methods (EconML, CausalML in Python), such as Causal Forest or Meta-learners (S-Learner, T-Learner), to estimate the Conditional Average Treatment Effect (CATE) for different segments. This allows us to understand which users (e.g., low/high check) generate truly incremental invites from the program and which simply capture organic sharing.

Describe the approach to isolating the causal effect of a referral program on long-term LTV in the presence of endogeneity from self-selection and delays in conversion of invited users.

Answer to the question

Real-life situation

Commonly overlooked by candidates