Historical Context. Traditional edtech platforms have long used static learning trajectories with fixed material difficulty for all users. With the development of Machine Learning and the capabilities of real-time data processing, adaptive systems have emerged that dynamically adjust content based on individual cognitive abilities of students. However, evaluating the effectiveness of such systems faces a fundamental methodological problem: it is impossible to show the same user both the adaptive and static version of a course for pure comparison without disrupting the user experience.
Problem Statement. Classic A/B Testing is not applicable here in its pure form, as the adaptation algorithm operates in real-time based on streaming interaction data, and fixing a user in a static group disrupts the product logic and creates ethical risks of providing a knowingly suboptimal educational experience. Moreover, there is significant endogeneity: users with different initial knowledge levels respond asymmetrically to adaptation (some require simplification, others - complication), necessitating methods to evaluate heterogeneous effects of impact.
Detailed Solution. The optimal approach is a combination of Regression Discontinuity Design (RDD) at the adaptation activation threshold and Difference-in-Differences (DiD) for cohorts of users with different implementation times. Firstly, if the algorithm is activated upon reaching a certain error level in problem-solving (for example, >30% consecutive incorrect answers), Sharp RDD can be used to compare users immediately before and after the adaptation activation threshold. Secondly, to assess the long-term effect on retention, the Synthetic Control Method is applied: a weighted combination of users from historical cohorts without access to the adaptive system is constructed, closely mimicking the behavior of the current test group prior to implementation. Additionally, Causal Forest or Meta-learners are used to quantify the heterogeneity of the effect by initial preparation segments. Data is aggregated through SQL with window functions to track sessions, and statistical analysis is conducted in Python using the causalml and pymc libraries for Bayesian uncertainty estimation and sklearn for building proxy variables.
In the online programming school "CodeStart," an adaptive tracking algorithm was implemented, which automatically simplified or complicated tasks in Python based on the speed of solving previous assignments and error patterns. The product manager required an assessment of whether this increased course completion from the current 45% to the target 60%, but the analytics team faced the issue that disabling the algorithm for the control group led to mass dropout on the second day of training, rendering the comparison invalid.
Three possible solutions were considered.
Option 1: Classic A/B test with full algorithm disabling for 50% of the traffic. The pros of this approach include ease of result interpretation and direct comparability of metrics between groups. Cons include a high risk of user loss in the control group due to frustration from excessive difficulty or, conversely, boredom from overly simple tasks, creating a survivorship bias and violating ethical norms of equal access to quality education.
Option 2: Historical data analysis prior to implementation (pre-post analysis) without a control group. Pros: no need to deprive part of the audience of improvement and the possibility of quickly obtaining results. Cons: inability to separate the algorithm's effect from external factors such as seasonality (beginning of the academic year), changes in the quality of traffic from advertising channels, and macroeconomic events, making the effect evaluation unreliable and subjective.
Option 3: Using Regression Discontinuity Design at the adaptation activation threshold with instrumental variables. This option was chosen because the algorithm was strictly and automatically activated upon exceeding a 25% error threshold in the module, creating a natural experiment. We compared users with 24% and 26% errors — practically identical groups concerning observed characteristics, but with different adaptation statuses. For the long-term evaluation, a synthetic control was built from cohorts of the past year with a similar distribution of initial skills using Propensity Score Matching.
The final result showed that the adaptive algorithm increases course completion by 18 percentage points (from 45% to 53%) for users with an average initial preparation level but has a negative effect (-5%) for advanced students, whom the system mistakenly simplified the material due to atypical solving patterns. Based on this data, a corrective factor for difficulty threshold for advanced users was introduced, bringing the overall conversion to 58%.
How to handle the situation when the adaptation algorithm is constantly learning (online learning), and its predictions change over time, making static effect evaluation invalid?
Answer. It is necessary to use thompson sampling or contextual bandits as part of the experimental design even at the implementation stage. Instead of a fixed impact, a probability distribution of effect is modeled, updated with each new observation. For evaluation, off-policy evaluation methods such as inverse propensity weighting (IPW) or doubly robust estimators are applied to correct bias arising from the algorithm's changing policy during historical data collection. It is critical to log the model version and its parameters for each decision made in ClickHouse or a similar storage system to later stratify the analysis by algorithm versions and account for its evolution.
Why does standard mean comparison (t-test) between groups with the algorithm enabled and disabled provide a biased estimate even with randomization, and how can this be fixed?
Answer. The problem lies in spillover effects and violation of the SUTVA (Stable Unit Treatment Value Assumption). If users interact with each other through forums, group projects, or chats, the control group "infects" through social learning and experience exchange. To correct this, cluster randomization (randomization at the class/stream level rather than individual users) or exposure mapping — modeling the probability of contact with the adaptive version of the course is used. Alternatively, two-stage least squares (2SLS) with an instrumental variable (e.g., error threshold for adaptation activation) is used to isolate the local average treatment effect (LATE).
How to distinguish the true adaptation effect from the novelty effect, where users interact more simply because the interface has changed rather than due to improved task matching quality?
Answer. It is necessary to conduct analysis by cohorts with different implementation dates and track the temporal dynamics of the effect over time. If engagement metrics return to baseline levels within 2-3 weeks after starting usage, this is a classic novelty effect. For separation, segmented regression with a break point (interrupted time series) or comparison with a holdout group, to which the algorithm "pretends" to be adaptive but actually shows random or fixed content (placebo test), is used. Additionally, it is vital to analyze not only proxy metrics (time on the platform) but also hard metrics (final exam results or practical project outcomes), which are less subject to short-term motivational fluctuations and reflect actual material acquisition.