Answer to the question

To measure the effect without randomization, it is necessary to construct a synthetic control using Propensity Score Matching (PSM) and then apply the Difference-in-Differences (DiD) method. First, we estimate the probability of receiving a badge (logistic regression) based on prehistorical data (activity, demographics, retention) to match the 'treated' users with similar 'control' users. Then, we compare the dynamics of the target metric (engagement depth) between these groups, which allows us to separate the badge effect from general growth trends.

It is critically important to test the assumption of parallel trends through event-study analysis: we build a regression with lags and leads of treatment and make sure that the coefficients before implementation are insignificant. To increase sensitivity, we use CUPED in Python or R, reducing variance through covariates before the experiment. The final estimate of ATT (Average Treatment Effect on the Treated) provides an unbiased measure of the pure effect of gamification.

Real-Life Situation

The company "EduTech" launched a motivation program: users received digital badges for leaving reviews about the courses. Technical limitations of the legacy backend did not allow for random audience division, so the analyst faced measuring the impact on the metric "engagement depth" (average number of lessons viewed per week) with strong self-selection: the most active students left reviews, creating an obvious bias.

Four approaches to solving the problem were considered.

Simple mean comparison after the implementation between those who received badges and those who did not. The main advantage is the speed of calculation in SQL without complex data preparation. The critical drawback is complete disregard for self-selection: active users grow faster anyway (maturation effect), leading to overestimation of the effect and false conclusions about effectiveness.

Before-after analysis solely on the group with badges. Advantages include eliminating intergroup differences and using a paired t-test for the same users. However, it is impossible to separate the badge effect from the overall seasonal growth in activity (beginning of the academic year) or simultaneous changes in recommendation algorithms, making the conclusions unreliable.

OLS regression with control for covariates by adding variables about past activity. This is quickly implemented in statsmodels and provides understandable coefficients. But this method requires strict linearity of dependencies, is sensitive to outliers, and does not account for individual user trends over time, which may distort the estimate.

PSM + Difference-in-Differences (the chosen solution). We conducted Propensity Score Matching in BigQuery, using logistic regression on predictors before launch (login frequency, completed courses). Then we applied DiD with fixed effects for users and weeks. Advantages include minimizing selection bias based on observable characteristics and removing time trends while maintaining parallelism. Drawbacks include high computational complexity and the criticality of the parallel trends assumption, which requires verification through event-study graphs.

This solution was chosen for its ability to provide the most unbiased estimate in the presence of only observational data. The analysis revealed that badges increased engagement by 12%, but only for users with less than three months of experience. For 'veterans', the effect was statistically insignificant, prompting the product team to revise the allocation rules and focus on onboarding.

What candidates often miss

How to check that the parallel trends assumption in DiD is not violated if we have no experiment?

Candidates often limit themselves to a visual comparison of graphs, overlooking formal testing. It is necessary to build an event-study regression, including dummy variables for each period before and after treatment. If the coefficients for "before" periods are statistically significant (p-value < 0.05), the assumption is violated. In this case, CUPED can be applied to adjust for pre-trends or the Synthetic Control Method can be used to construct a control group with a trend closely resembling that of the treated group before intervention.

Why does Propensity Score Matching not solve the endogeneity problem from unobserved characteristics (selection on unobservables)?

PSM balances only observable covariates (age, activity), but if there exists a hidden motivation (for example, "love for learning") that is hard to quantify, bias remains. Solutions require instrumental variables (IV), such as geographic distance to the nearest offline center, which correlates with the likelihood of receiving a badge but does not directly affect engagement. An alternative is Regression Discontinuity Design (RDD) if the threshold for receiving a badge is strict (e.g., exactly 3 reviews), creating exogenous variation.

How to handle violations of SUTVA (Stable Unit Treatment Value Assumption) in gamification when the effect is "contagious" through the social graph?

If friends see badges and also start leaving reviews, standard DiD gives a biased estimate, mixing direct and indirect effects. The solution is to use clustered standard errors among friend groups or a two-stage sampling approach, where users related to the "treated" are excluded from the control group. Spillover effects can be explicitly assessed through mediation analysis in Python (using causalml or mediation libraries), separating the total effect into direct (on the user themselves) and indirect (on friends) to avoid underestimating the true effect.

How would you assess the causal effect of implementing a gamification system (badges for course reviews) on user engagement depth in an edtech application, using a quasi-experimental approach when a classic A/B test cannot be conducted?

Answer to the question

Real-Life Situation

What candidates often miss