The historical context traces back to the evolution of freemium models from static limits (fixed 5GB in the cloud) to dynamic, adaptive restrictions based on Machine Learning. Classic approaches for evaluating the effectiveness of such interventions face fundamental endogeneity issues: the system intentionally restricts users with a high predicted propensity to convert, creating a strong selection bias. Early correlation analysis methods provided biased estimates, as they ignored confounding by indication, leading to an overestimation of the effect by 200-300%.
The problem requires measuring Local Average Treatment Effect (LATE) under conditions where assignment of limits correlates with latent user motivation. The model predicts the probability of conversion $P(conv|X)$ and assigns a limit when $P > \tau$, making groups incomparable in terms of observed and unobserved characteristics. Direct comparison of users with a limit versus those without leads to overestimation, as the treated group is inherently more "motivated" and willing to pay.
A detailed solution is based on Regression Discontinuity Design (RDD) at the threshold $\tau$ of the scoring model. Around the threshold (bandwidth $h$), the assignment of limits is quasi-random, as users with $P = \tau - \epsilon$ and $P = \tau + \epsilon$ are statistically indistinguishable. A continuous regression of outcomes on scoring is built with an estimate of the jump at point $\tau$. For increased accuracy, Causal Forest is applied to estimate the heterogeneity of the effect, while during phased implementation, Difference-in-Discontinuities is used to control temporal trends. Alternatively, Inverse Propensity Weighting (IPW) can be applied with estimation of propensity score through Random Forest, but this requires the unconfoundedness condition, which is rarely fully satisfied.
Problem
In a B2B SaaS task management product, a dynamic limit on the number of active projects for free accounts was implemented. The ML model analyzed 50+ behavioral features and blocked the creation of new projects, predicting a conversion probability above 0.75. The product team observed a 40% increase in conversion among "limited" users, but could not separate the effect of the limitation from self-selection of motivated users. Moreover, a complete ban on limits for the test was impossible, as it would mean a loss of $200K MRR per month of the experiment.
Option 1: Naive Comparison with Historical Data
Compare the conversion of current users with a limit against a cohort from two months ago before implementing the feature. Pros: requires minimal infrastructure costs, quick assessment without technical changes. Cons: completely ignores seasonality (New Year activity drop), the general trend of conversion growth (the product was becoming more mature), and the novelty effect; leads to a biased estimate upwards by 35-40% due to selection bias.
Option 2: Classic A/B Test with ML Model Turned Off
Randomly disable the assignment of limits for 15% of users, allowing them to use the product without restrictions regardless of scoring. Pros: golden standard of causality, direct measurement of Average Treatment Effect (ATE). Cons: categorically rejected by C-level due to the risk of losing "hot" users who will not get the conversion trigger in the control group; creates significant opportunity costs and ethical conflicts (why do we allow some everything, but not others?).
Option 3: Regression Discontinuity Design with Hybrid Approach
Use the natural scoring threshold (0.75) as the cutoff point, comparing users with a conversion probability of 0.74 and 0.76 as locally randomized groups (~5000 users in the window ±0.05). Supplement with Synthetic Control Method for regions where the implementation is postponed by a month. Pros: maintains business logic for 95% of users; provides an unbiased estimate of the local effect (LATE) for "marginal" users; allows for the use of natural variation without harming revenue. Cons: requires a large sample around the threshold (>2000 observations); the estimate is only applicable to a subgroup with $P(conv) \approx 0.75$, not the entire population; sensitive to threshold manipulation (requires McCrary test for density distribution).
Chosen Solution and Result
RDD was chosen with optimal bandwidth using the Calonico-Cattaneo-Titiunik (CCT bandwidth) method, supplemented with Causal Forest to identify subpopulations with negative effects. The analysis revealed that a strict limit yields a +12% increase in conversion for "average" users (around the threshold), but -8% in retention for power users (high engagement, but scoring slightly below the threshold). Based on this, a hybrid mode was implemented: soft limits (warning only) for power users, hard caps for average users. The final result: an 8% increase in conversion while maintaining 30-day retention at 96% of baseline, bringing an additional $450K ARR over the quarter without losing key users.
How to distinguish the effect of the limitation itself from the "reminder effect" of the paid version?
Candidates often interpret the increase in conversion as the result of only the financial limitation, ignoring that the notification about the limit acts as a marketing touchpoint. To isolate it, an additional control group with a "soft" notification (information about premium without blocking the function) is needed or an analysis of the time between showing the limit and conversion. If the conversion happens instantly (within an hour) — it's likely a reminder effect, if after 3-7 days following several attempts to exceed the limit — it’s the actual effect of the restriction. An instrumental variable in the form of technical latency for displaying the notification could also be used as a random variation in reminder intensity, applying 2SLS regression.
How to account for network effects in team products (Notion, Figma), where the limitation of one user affects the collaboration of colleagues?
In B2B SaaS, the limitation of one team member creates spillover effects: colleagues may either aggregate resources into one account or migrate to a competitor. Classic RDD ignores these external effects, violating SUTVA (Stable Unit Treatment Value Assumption). The solution is cluster-RDD at the team/workspace level, where treatment is defined by the share of "limited" users on the team, or using two-stage least squares (2SLS) with the number of limited neighbors in the network graph as an instrument. It’s important to measure violation via network activity analysis (network adjacency matrix) among users with different limit statuses, checking the hypothesis of homophily in teams.
How to separate the true effect of restricting a specific function from the shift in use to less valuable functions (substitution bias)?
Users facing a limitation on function A may migrate to function B (e.g., from tables to text documents), creating an illusion of high retention, but actually degrading product stickiness and feature adoption depth. To measure this, an analysis of Shannon entropy of feature usage (measuring diversity of usage) or compositional data analysis (CODA) is needed. If entropy decreases after the limitation, it indicates cannibalization within the product. The optimal policy should maximize not just conversion but expected LTV considering changes in usage patterns, requiring modeling through Markov Decision Process (MDP) or contextual bandit with a reward function accounting for depth of feature adoption and engagement velocity, not just the fact of conversion.