Product Analytics (IT)Product Analyst (IT)

What method should be used to evaluate the causal effect of the phased implementation of mandatory two-factor authentication (2FA) on 30-day retention and transaction frequency, if there is self-selection of users based on their level of technical literacy, and the data is subject to seasonal fluctuations in activity?

Pass interviews with Hintsage AI assistant

Answer to the question

Historically, the evaluation of the implementation of frictional security measures such as 2FA has evolved from naive before/after comparisons to the application of quasi-experimental methods. When classical A/B testing is impossible due to technical constraints in the authentication architecture or ethical security considerations, analysts turn to difference-in-differences (DiD) methods, which allow separating the effect of the intervention from temporal trends. The main challenge is that users willing to accept additional friction in the form of 2FA systematically differ from others by having high motivation or paranoid tendencies, creating endogeneity of self-selection and distorting simple correlational estimates.

The problem requires isolating the true effect of forced authentication from confounders: seasonal peaks in activity (e.g., pre-holiday sales), natural degradation of retention in new cohorts, and differences in the baseline characteristics of users adopting security measures. Without a proper identification strategy, a business might mistakenly attribute a natural seasonal drop in activity to the negative effect of 2FA, or conversely, attribute the self-selection effect to the success of the feature, leading to unjustified extensions of frictional measures to the entire audience.

A detailed solution involves the use of Staggered Difference-in-Differences (DiD) with a cohort-based approach, where different user groups (cohorts) receive mandatory 2FA at different points in time. For each cohort, the control group consists of users registered just before the intervention (the regression discontinuity threshold) or those who have not yet been subjected to the intervention cohort. To adjust for self-selection, Inverse Probability Weighting (IPW) is employed: weights are constructed based on previous behavior (history of biometric use, frequency of password changes) to balance the characteristics of the groups. Seasonality is accounted for through fixed time effects (weekly or monthly dummy variables). Robust checks employ the Synthetic Control Method (which weighs untreated cohorts to imitate the trend of treated ones) and Event Study (to visualize the dynamics of the effect before and after implementation and to check the assumption of parallel trends).

Real-life situation

In a mobile bank, it was decided to implement mandatory 2FA via SMS and TOTP applications for all logins, abandoning optionality due to a rise in fraud. The rollout was organized by registration date cohorts: users registered before March 1 remained unchanged (control), while each subsequent week of new registrations received mandatory 2FA (treatment). Two weeks after the start, metrics showed a catastrophic drop of 25% in 30-day retention among the "processed" cohorts, causing panic in the product department and suggestions to roll back the change.

The first option considered was a simple comparison of the retention rate between users with 2FA and those without, over the same observation period. The pros of this approach lie in its immediate computability and clarity; the cons lie in a fatal methodological error: users who voluntarily turned on 2FA before the mandatory implementation were hyper-active or paranoid, and their natural retention was 40% higher, making such a comparison incorrect.

The second option was to analyze cohort retention curves without controlling for time, simply visually comparing the curves of "March" and "February" users. The pros include accounting for different starting points in the lifecycle; the cons include ignoring seasonality (March is a tax payment period with a peak activity followed by a natural decline) and an inability to separate the effect from the overall trend of declining traffic quality from new advertising channels launched in March.

The third option was the application of Staggered DiD using the Callaway-Sant'Anna method to estimate group-time effects (Group-Time ATT) and propensity score matching within each cohort. The pros include correct handling of different processing times, excluding the use of "already treated" as control for "just treated", controlling seasonality through fixed effects; the cons include complexity of interpretation, the need to check parallel trends, and sensitivity to outliers in small cohorts.

The third solution was chosen, as the first two demonstrated either overly optimistic (self-selection) or catastrophically pessimistic (seasonality) scenarios. Analysis showed that the true causal effect on 30-day retention was -8% (not -25%), offset by a +20% increase in average receipt due to increased trust in secure accounts. The final result was that the product team retained mandatory 2FA, but added a "Trusted device for 30 days" option, which reduced friction and returned retention to baseline levels within 60 days, while still maintaining a 60% reduction in fraudulent transactions.

What candidates often overlook

Why can the standard two-way fixed effects (TWFE) estimator in linear regression with fixed user and time effects provide biased or even opposite sign estimates in a staggered design of 2FA implementation, and which modern estimator should be used instead?

In the standard TWFE approach, users who have already undergone treatment (2FA) in the early cohort are automatically used as a control group for users from later cohorts who have not yet received treatment. If the effect of 2FA changes over time (e.g., users adapt and friction decreases) or varies between cohorts (early adopters vs. late), previously treated units serve as "bad" counterfactuals, leading to the issue of "negative weights" and biased estimates. Instead of TWFE, the Callaway-Sant'Anna estimator should be applied, which calculates the average treatment effect (ATT) separately for each group and time, using only never-treated or not-yet-treated units as controls, excluding already treated units from the control pool, ensuring correct identification. For a beginner, imagine comparing the effect of a new rule for a class that received it in September by using as control a class that received the rule in October. If by October the first class is already accustomed, while the second is just experiencing shock, you will get a distorted picture — modern methods compare only with those who have never received the rule.

How to properly handle the situation of "contamination" or "leakage" of treatment when users subjected to mandatory 2FA on a mobile device start actively using the web version of the app (where 2FA has not yet been implemented) to bypass restrictions, and why does simply excluding such users from the sample create bias?

Simply excluding "switchers" creates truncation bias or selection bias, as the remaining users in the sample are those who are either less motivated to avoid friction or less technically competent, which distorts the estimate of the effect on the target population. The correct approach is to analyze Intent-to-Treat (ITT), where all users are analyzed in the group they were originally assigned to (the mobile app with 2FA), regardless of actual behavior (transition to the web). To estimate the effect of the mechanism itself (Treatment-on-Treated, TOT), the Two-Stage Least Squares (2SLS) method is employed, where actual use of 2FA is instrumented through cohort membership, allowing the estimate to be cleaned of "non-compliance". For a beginner: this is similar to a clinical trial where patients in the medication group stop taking it. If you remove them, you lose information about whether the medication "pushes away" a certain type of patient, and you will overestimate effectiveness. ITT analyzes "assignment" rather than "actual intake", preserving randomization.

How to distinguish the net effect of friction (the necessity to enter a code) from the "signaling" or "signposting" effect (the sense of heightened security created by the mere fact of having 2FA), and why is it important to conduct mediation analysis when assessing the impact on monetization?

The importance of separation lies in the fact that these effects have opposite directional influences on behavior: friction reduces conversion and login frequency, while the security signal increases readiness to make large transactions and trust in the platform. To achieve separation, Causal Mediation Analysis is used (e.g., the Imai-Keele-Tingley approach), where the total effect (Total Effect) is decomposed into direct (friction) and indirect through the perception of security (mediator). Alternatively, a placebo group is created, receiving a banner about "enhanced security" and a 2FA icon, but without an actual requirement to enter a code; comparison of [Full 2FA] vs [Banner without 2FA] vs [Control] allows for isolation of components. If an increase in average receipt is observed in the placebo group as well, the signaling effect predominates; if only in the full group — the effect is due to the authentication process itself. For a beginner: imagine that a security guard appeared at the restaurant door. People might spend more feeling safe (signal), but some might not enter if they don't want to go through screening (friction). To understand whether to keep the guard, it's necessary to separate these effects; otherwise, you won’t understand whether you need to hire a friendlier guard or if it’s enough just to hang a sign "Guarded".