Answer to the question

Historically, product teams have focused solely on growth metrics and the implementation of new features; however, with the saturation of digital products and the accumulation of technical debt, the justified removal of features (feature deprecation) has become critically important. The problem lies in the fact that users who actively used the feature being removed systematically differ from the rest of the audience in terms of engagement and loyalty, creating selection bias, and the phased disconnection across cohorts distorts time series with seasonality and natural churn.

To isolate the true causal effect, Difference-in-Differences (DiD) with cohort analysis or CausalImpact based on Bayesian Structural Time Series should be employed, using unaffected cohorts as a synthetic control. A key step is to build a propensity score matching (PSM) model within each cohort: for users who lost the feature (treatment), pairs of users who have never used it (control) are matched but have similar activity profiles, tenure, and conversion history. If there is a clear threshold for feature usage intensity (e.g., >5 uses per month), Regression Discontinuity Design (RDD) is effective, allowing for comparison of users directly on either side of the disconnection threshold.

It is also important to additionally control for survivorship bias: if the feature is removed due to low usage, the analysis should include only active users at the time the decision is made, excluding those who had already churned before the observation began. For assessing long-term effects, staggered DiD with dynamic effects (event study) is applied, which allows tracking changes in three-day and seven-day retention relative to the moment of disconnection, and testing the Parallel Trends Assumption through placebo tests on prior periods.

Real-life situation

In a large edtech product, a decision was made to remove an outdated text chat with a mentor in favor of video consultations since the chat was used by less than 3% of the audience, but its support accounted for 20% of the team's resources. The release was planned in phases: first, disconnection for new users, then for cohorts with low activity, and finally for power users. The business feared that the removal would trigger a wave of negativity and churn among high-value users who had historically used the chat intensively to clarify assignments.

The first option considered was a simple comparative analysis of retention before and after disconnection for each cohort. This approach would differ in its quick implementability and clarity for stakeholders, but it severely suffered from the inability to separate the effect of removal from the natural aging of the cohort and seasonal fluctuations in student activity during the summer period when the final phase of disconnection was planned. The second option was a classic A/B test with a feature flag that hid the chat for 50% of users, but it was rejected due to the technical complexity of supporting two versions of the UI and ethical considerations: it was not possible to promise chat support to some users while denying it to others in the presence of bugs.

The third option selected was the analysis method Difference-in-Differences with synthetic control. For each cohort losing access to the chat, analysts, through Propensity Score Matching, found a pair of users from the previous cohort who had never opened the chat but had an identical lesson viewing pattern, homework submission history, and geography. This allowed comparing the retention trajectories of the treatment group (who lost the chat) and the control group (who had never used it), isolating the pure effect of losing the feature from overall trends.

The final result showed that for power users (top 10% by frequency of chat usage), the removal indeed decreased 30-day retention by 8%, but this was compensated by a 15% increase in conversion to video consultations and improvements in app performance metrics (a 12% reduction in crash rate due to the removal of legacy code). For the average segment, the effect was statistically insignificant, which allowed the business to justify the complete disconnection of the feature, focusing on migrating power users to the new communication channel through personalized offers.

What candidates often overlook

How to distinguish the effect of removing a feature from the effect of "simplifying" the interface (simplification effect), when reducing cognitive load may mask the negativity from the loss of functionality?

The answer lies in the decomposition of metrics: it is necessary to track not only retention but also task completion time, error rate, and feature discovery rate for the remaining functionality. If after removing the chat, the metric time-to-homework-submission decreases (users submit work faster) while retention remains stable, this indicates a positive simplification effect compensating for the loss of the communication channel. For quantitative assessment, mediation analysis is constructed: it evaluates the direct causal link "removal → retention" and indirect through "removal → simplification UI → retention," which allows separating the pure negative from structural improvement in UX.

How to correctly calculate statistical power for a non-inferiority test when removing a feature, where the goal is to prove that the damage does not exceed an acceptable threshold?

Candidates often apply classic power calculations for superiority testing, leading to unfounded conclusions about the "safety" of removal. In non-inferiority testing, the null hypothesis is formulated as "effect worse than the threshold," and the power depends on the Margin of Indifference (δ), which must be determined by the business in advance (e.g., -2% to retention). The power formula requires specifying the expected true effect (usually 0 or a small positive) and variance, with approximation to δ requiring exponentially larger samples. It is necessary to use specialized power calculators for paired proportions with adjustments for clustering by cohorts, as users within one cohort correlate by the time of disconnection.

How to account for spillover effects when the removal of a feature from one user impacts the behavior of others through the disruption of communication ties?

In social products or B2B SaaS, the removal of a feature from one actor (e.g., disabling an old API for an admin) affects the experience of end users (employees), creating interference between treatment and control. To isolate this effect, cluster-based randomization or analysis via exposure mapping is used: instead of individual treatment status, the share of users in the social graph (team, family) who have lost the feature is employed. If the correlation between the individual fact of disconnection and the share of disconnections in the cluster is high (>0.8), classic OLS gives biased estimates. The solution is to use IV regression (instrumental variables), where the fact of belonging to the disconnection cohort serves as the instrument, while the actual loss of the feature is the endogenous variable, or to apply causal inference methods for interference, such as Fisher's randomization test with correction for cluster size.