Answer to the question
Historically, customer support has evolved from a monopoly of human operators to automation through rule-based chatbots, which often frustrated users due to rigid scenarios. The modern stage is characterized by the implementation of Large Language Models (LLM) like GPT-4 or Claude, capable of conducting contextual dialogues and solving complex tasks without rigid programming logic. The problem of evaluating the effectiveness of such systems is exacerbated by the fact that traditional metrics (resolution time, cost-per-ticket) correlate with service quality non-linearly: cost reduction can lead to a drop in CSAT, and increased automation may lead to higher frustration during unsuccessful escalations.
The task requires isolating the pure effect of the AI assistant, separate from seasonality (holiday sales change the profile of inquiries), novelty effect (users experiment with the bot more actively in the first weeks), and endogeneity of self-selection (simple requests go to the bot, complex ones straight to humans). Classical randomization is impossible, as turning off support for the control group during peak hours creates ethical and business risks, and the escalation of dialogue from bot to human contaminates the pure effect.
The optimal solution is to use Regression Discontinuity Design (RDD) at the threshold of the queue length. When the number of waiting users exceeds the threshold N (for example, 5 people), the system automatically offers the AI assistant as an alternative to waiting for an operator. This creates a natural experiment: users on either side of the threshold are statistically identical regarding observed and unobserved characteristics. To account for the learning effect of the model, Difference-in-Differences is applied with a proxy group — for example, night-time users, where the bot operates continuously, are compared to a similar time window before implementation. To analyze heterogeneity of effects (different impacts for different inquiry categories), Causal Forests are used, allowing for the construction of conditional average treatment effects (CATE).
Real-life situation
In a large e-commerce project with 500K inquiries per month, the team decided to implement an LLM assistant to handle inquiries like "where is my order" and "change delivery address." The problem was that the pilot coincided with the pre-New Year season when traffic increased threefold, and historical data showed a seasonal decline in CSAT due to logistics delays regardless of support quality.
The first option considered was a direct comparison of metrics one month before and one month after implementation. Pros: simplicity of implementation, no changes in infrastructure required. Cons: complete lack of control for seasonality, impossible to separate the effect of AI from the effect of increased overall traffic and changes in the assortment (New Year products have a different return profile). This approach was immediately rejected.
The second option was a geo-split A/B test, where the bot was enabled in some regions and not in others. Pros: clean randomization, straightforward interpretation. Cons: network effects (a user might live in region A but place an order in region B for a friend), different logistical infrastructures affect the nature of inquiries, and during peak hours, overload in one region would create the risk of losing customers. An alternative was sought instead.
The chosen solution was RDD with a queue length threshold of 3 people. When the queue exceeded 3 waiting users, the system suggested the AI assistant with the option to stay in the queue for an operator. To adjust for the escalation effect, Intent-to-Treat (ITT) analysis was used: comparing everyone who was offered the bot regardless of actual usage, which avoided the bias of self-selection based on technical literacy. Additionally, a Synthetic Control was built from historical data of similar inquiry categories, where the bot was not utilized (for example, complex complaints), to filter seasonal fluctuations.
The final result: it was measured that the AI assistant reduces the average resolution time for simple inquiries from 8 to 2 minutes without a statistically significant drop in CSAT (a difference of 0.1 points within the confidence interval). However, a negative effect was discovered for the “returns” segment: during escalation from the bot to a human, CSAT was 15% lower than during a direct approach to the operator, which led to the creation of a separate fast-track route for such inquiries. Operational costs decreased by 30% due to the offloading of the first line.
What candidates often miss
How to correctly handle the endogeneity of escalation when a user, disappointed with the bot, turns to a human with increased frustration?
Candidates often suggest comparing only successful dialogues with the bot against dialogues with a human, ignoring survival bias. The correct approach is to analyze Local Average Treatment Effect (LATE) through instrumental variables: using random technical failures in the bot's operation (when it is temporarily unavailable) as an instrument to assess the effect specifically for those who would have been served by the bot if such an opportunity existed. This allows separating the effect of the technology itself from the selection effect based on the type of inquiry.
Why are standard bot accuracy metrics (F1-score, BLEU) incorrect for product evaluation of causal impact?
Analysts often focus on the quality of response generation, forgetting that the product goal is to change business metrics, not technical perfection. An LLM can generate correct but irrelevant responses or, conversely, provide technically inaccurate but effective instructions for solving a user's problem (for example, "try restarting the application"). The correct approach is to assess uplift at the user session level using Propensity Score Matching to match the complexity of inquiries rather than the accuracy of text generation.
How to account for the non-stationarity of the effect when continuously retraining the model on new data?
Candidates overlook that an LLM in production undergoes continual learning: the model is retrained on labeled dialogues daily, so the effect of week 1 is not comparable to the effect of week 4. It is necessary to use Time-Varying Treatment Effects models with rolling window estimation or Bayesian Structural Time Series (BSTS) for dynamic adjustment of the baseline. Ignoring this leads to an underestimation of the long-term effect when the bot "learns" from product specifics or to an overestimation of the novelty effect.