Answer to the question

Historical context suggests that voice interfaces have evolved from simple command systems to full-fledged NLP solutions based on transformers, yet the methodology for their evaluation remains non-trivial due to the heterogeneity of technology adoption. The problem lies in the fact that the feature is only available on devices with certain technical specifications, creating a systematic selection bias, while the geographic rollout violates the principle of random distribution. To isolate the true effect, a combination of Difference-in-Differences with fixed effects by regions and time, supplemented with Synthetic Control Method for regions with unique linguistic patterns, and Instrumental Variables for correcting the endogeneity of feature usage should be employed.

Real-life situation

In an electronics marketplace, voice search was first launched in Moscow and St. Petersburg, planning a gradual rollout to other regions. The problem was that the feature only worked on iPhone XS and newer with iOS 15+, as well as flagship Android devices with on-device ML support, creating a bias in income and technological awareness of the audience. Furthermore, there was clear seasonality — the implementation coincided with pre-New Year demand surge, distorting the direct "before-after" comparison. The team considered three approaches to evaluation.

The first option involved a simple comparison of average metrics in regions with and without the feature over the same time period. The pros of this approach are its simplicity and speed of results. The cons are the critical neglect of systematic differences between regions (Moscow historically shows higher conversion rates) and the inability to separate the feature's effect from seasonal trends. This option was rejected due to a high risk of false positive conclusions.

The second option used Propensity Score Matching to create a control group of users without voice search but with similar device characteristics and behaviors. The pros are the attempt to eliminate bias based on observed characteristics. The cons are the inability to account for unobserved factors (e.g., propensity for early technology adoption) that simultaneously influence both ownership of modern devices and willingness to make purchases. Additionally, matching loses effectiveness in the presence of fixed regional effects.

The third option combined Difference-in-Differences at the regional level with Instrumental Variables at the user level. The technical availability flag of the feature on a device (dependent on the smartphone model and OS version, but not directly on user preferences) was used as an instrument to predict actual usage through Two-Stage Least Squares. For regions with unique dialects (Kazan, Novosibirsk), Synthetic Control was applied, weighting control regions based on previous conversion trends. The pros include the separation of the availability effect from user self-selection effects and control of regional trends. The cons are the complexity of interpreting the Local Average Treatment Effect (LATE) and the requirement for the parallel trends assumption. This option was chosen as the most robust.

The analysis revealed that voice search provides an incremental increase in browsing depth by 18% among users with compatible devices, but no statistically significant effect on conversion to purchase was found. Moreover, in categories with technical terms (computer components), a decrease in conversion was observed due to recognition errors of specific lexicon. This allowed the team to adjust their roadmap: to improve the recognition of technical terms before scaling and to focus marketing on categories of "simple" products (home appliances) where voice search showed the best results.

What candidates often overlook

How to separate the short-term novelty effect from sustained behavioral change when evaluating voice interfaces?

Candidates often ignore the temporal dynamics of adaptation. It is necessary to perform cohort analysis from the day of first use of the feature and track retention usage over a horizon of 3-4 weeks. If the intensity of use declines along an exponential decay curve to baseline levels, the effect is merely novelty. For accurate assessment, only the established period (steady state) should be used or observations should be weighted by the lifespan of the cohort. It is also important to check the heterogeneity of the effect by usage frequency — power users may demonstrate sustained behavior, while casual users are susceptible to the novelty effect.

How to properly handle zero values in the data when a user activated voice search but received no results due to recognition errors?

Standard linear regression or logistic models are incorrect here due to the mixed distribution: a mass of zeros (failed attempts) and a continuous distribution of positive outcomes. It is necessary to apply a Two-part model (hurdle model) or Zero-Inflated Negative Binomial for count metrics (number of views). The first part of the model assesses the probability of a successful search (selection equation), while the second assesses the intensity of use given success (outcome equation). Ignoring this structure leads to an underestimation of the effect, as failed attempts are mistakenly classified as a lack of interest rather than a technical barrier.

Why can't a simple Intent-to-Treat (ITT) comparison of all users in the implementation region against a control region be used in this case?

ITT analysis mixes the effect of feature availability with the effect of its actual usage, blurring the evaluation. If only 10% of the audience has compatible devices and only 20% of them will try the feature, ITT will show a 2% effect even with 100% effectiveness for real users. For business decisions, the Treatment-on-Treated (TOT) effect or Local Average Treatment Effect (LATE) obtained through instrumental variables is critical. Candidates overlook that compliance (adherence to assignment) is not 100% here, and it is necessary to scale the ITT estimate back proportional to the share of complainers to obtain the true effect on those who actually use the feature.