Answer to the question

The evolution of e-commerce from text search to a multimodal interface began with the emergence of Convolutional Neural Networks (CNN) in mobile applications in the mid-2010s. Classical approaches to A/B testing here encounter hardware fragmentation: the same visual search algorithm demonstrates varying accuracy on flagship devices and budget smartphones.

Early studies have shown that users with low-end devices have systematically different browsing patterns, creating a threat to the violation assumption of error independence from covariates in standard econometric models. This makes simple group comparisons through t-tests or basic regression methodologically invalid.

Fundamental endogeneity arises from self-selection at the adoption level: technically savvy users (early adopters) are simultaneously inclined to try the new feature and have a high baseline conversion. Additionally, structural cannibalization is observed: visual search "takes away" queries from text search but simultaneously transforms less informative text queries into highly informative visual embeddings.

The technical heterogeneity of camera quality introduces an additional layer of measurement error, correlating with the user's SES profile. Standard methods for controlling selection bias, such as Propensity Score Matching, are insufficient here due to the presence of unobserved heterogeneity in users' visual literacy.

The optimal strategy is Two-Stage Least Squares (2SLS) using camera capabilities (availability of Telephoto Lens, support for Night Mode) as an instrumental variable (IV). The exclusion restriction is fulfilled on the condition that camera specifications affect conversion only through the ability to utilize visual search, not through correlating income characteristics.

The validity of the instrument is checked through an Overidentification Test using exogenous variation in camera batches. For cannibalization, Principal Stratification is applied: dividing users into strata based on a latent class model, where classes are defined by the likelihood of switching from text search.

Heterogeneous Treatment Effects are assessed through Causal Forests with clustering at the device-type level to account for error correlation within hardware classes. Additionally, shooting metadata (EXIF data on exposure) is controlled to isolate the effect specifically from recognition rather than external conditions.

Real-life situation

The team at the marketplace "FashionHub" launched visual search on 20% of traffic, observing an 18% conversion increase among adopters. However, an audit revealed that 70% of users with iPhone 12+ (high-quality camera) ended up in the test group, while the Android budget segment remained in the control, creating hardware-based confounding. The key metric — the average number of viewed product cards before purchase — increased disproportionately in the premium devices segment.

A crude comparison of adopters vs non-adopters would give an estimate of +18% to conversion but would carry a survival bias. Users who took photos of products already demonstrated high purchase intent and tolerance for friction in UX. The plus of the approach is the simplicity of interpretation and speed of obtaining results. The downside is the inability to separate the causal effect of the feature from the self-selection of technically savvy audiences with high baseline conversion.

Geographic rollout with Difference-in-Differences proposed launching first in Moscow (high penetration of premium smartphones), then in the regions a month later. The plus is the ability to account for temporal trends and seasonality in fashion. The downside is that regions varied in disposable income and fashion values, which violated the parallel trends assumption; the Moscow audience had systematically different elasticity to novelty in digital features.

Instrumental Variables with Propensity Score Matching used the technical impossibility of launching visual search on devices without Auto-Focus and OIS (Optical Image Stabilization) as a natural experiment. Users with compatible devices were matched with similar demographics and text search history, but with unsupported devices. The plus is the exogeneity of the instrument (hardware precedes the purchase decision). The downside is that the relevance requirement was checked through the first-stage F-statistic (45, above the threshold of 10), while the exclusion restriction required confidence that the camera affects purchase only through search.

An IV solution was chosen with additional control for lighting conditions via API determining the time of day and analysis of EXIF metadata from photos (ISO, exposure time). The final result: the true Local Average Treatment Effect (LATE) amounted to +4.2% conversion (everything else is selection bias), with the effect concentrated in the "footwear" category (where color matching is critical), and absent in "accessories" (where brand dominates over visual characteristics).

What candidates often overlook

Why can't we just do an A/B test at the user level, if the infrastructure allows?

Candidates ignore network effects in training the Visual Embeddings Model: when users take photos, that data enters the training sample of a Siamese Network, improving search quality for all users, including the control group (spillover effects). Additionally, SUTVA (Stable Unit Treatment Value Assumption) is violated through ranking contamination: if visual search raises relevant products in the overall recommendation feed, it affects the behavior of the control group.

The solution is Cluster Randomization at the device-type level or using Exposure Mapping with adjustment for feature usage intensity in the cluster through Inverse Probability Weighting.

How to separate text search cannibalization from creating new demand when intent is not latent?

The standard total queries comparison ignores quality-adjusted volume. The Principal Stratification Framework needs to be applied: define four strata (Compliers, Never-takers, Always-takers, Defiers) based on the potential outcomes of using text search with/without visual search.

Then estimate Complier Average Causal Effect (CACE) for those who would switch from text to visual only if available. Additionally, use Embedding Space Distance between user text queries and product categories: if visual search reduces the semantic distance between query and purchase, this is an incremental effect, not substitution.

What is the danger of conditioning on the number of successful recognitions when analyzing retention?

This is a classic Collider Bias (M-structure): conditioning on "recognition success" (which depends on both camera quality and query complexity) opens spurious paths between hardware and retention. Candidates often filter out "failed uploads," creating selection on dependent variables.

The correct approach is Heckman Two-Step Correction or Tobit Model for zero-inflated outcomes, where the decision to use the feature and the outcome conditional on usage are modeled jointly, taking into account the Inverse Mills Ratio from the first equation of the probit model with predictors (lighting, time of day, product category).