Product Analytics (IT)Product Analyst / Analytics Lead

What method should be used to assess the causal effect of implementing the virtual try-on feature (AR try-on) on the reduction of returns and the increase in conversion in the accessories category, considering that the availability of the feature is constrained by device specifications (TrueDepth camera/ARKit), which creates a systematic selection bias based on users' income, and that the rollout occurs gradually across product categories?

Pass interviews with Hintsage AI assistant

Answer to the question

Historically, approaches to evaluating AR features in product analytics relied on correlational analysis or simple mean comparisons between users with and without technology support. This methodological framework dominated until 2018, when retail researchers began to account for systematic differences in audience segmentation based on device price categories. Owners of flagship smartphones with ARKit or ARCore are statistically significantly different in terms of income levels, technological adaptability, and propensity to impulse purchases of high-margin products.

Thus, direct comparisons create a self-selection bias of up to 40%, making it impossible to separate the function's effect from pre-existing differences between groups. Classic A/B testing is also not feasible, as forcing AR on incompatible devices leads to technical failures, app crashes, and distorted user experience, violating the fundamental principle of SUTVA (Stable Unit Treatment Value Assumption) and creating a negative reaction effect to engagement.

The optimal solution requires implementing Regression Discontinuity Design (RDD) around the threshold of device specifications, for example, comparing users of the iPhone X and iPhone 8+, who have similar price accessibility in the secondary market and demographic characteristics but critically differ in the availability of the TrueDepth camera, necessary for AR. To account for the phased implementation across product categories, we augment Difference-in-Differences (DiD) with fixed effects for category-time (Two-Way Fixed Effects), controlling for seasonality and assortment differences. Finally, we apply Propensity Score Matching (PSM) by the device price segment and purchase history to adjust for residual heterogeneity within the local RDD zone, allowing us to extrapolate the local average treatment effect (LATE) to the population using Inverse Probability Weighting.

Real-life situation

In a major fashion marketplace, the AR try-on for sunglasses was launched in the fall of 2023 using face tracking technology. The feature worked exclusively on iPhone X+ and flagship Android devices with Google ARCore, automatically excluding 60% of the audience with budget devices. Preliminary analytical reports showed that users with access to AR converted to purchase 3.5 times more often and returned products 30% less frequently, but the team suspected a strong survival bias: owners of expensive phones historically showed higher average baskets and loyalty regardless of new features.

The first considered option was a direct mean comparison using the t-test or Mann-Whitney U test between AR-accessible groups with no adjustments. The advantages of this approach included immediate calculation, minimal data requirements, and intuitiveness of the result for business stakeholders. The downsides were critical: catastrophic endogeneity regarding income and technological awareness made it impossible to separate the function's effect from pre-existing differences between user segments.

The second option was a cohort before-after analysis for users who upgraded their devices from incompatible to compatible with AR during the observation period. The advantages included controlling for individual heterogeneity through within-subject comparison, eliminating bias from unmeasured user characteristics. The downsides included strong novelty effect influence, seasonality (the peaks of phone upgrades in December and September correlate with different purchasing patterns), and self-selection regarding the timing of upgrades (motivated users change phones more often).

The third option was to use Regression Discontinuity Design around the iPhone X model threshold (A11 Bionic chip), comparing users of the iPhone 8+ and iPhone X, who are statistically indistinguishable in socio-demographic characteristics and price category in the secondary market but only differ in the presence of the TrueDepth camera. The advantages of this method included creating a quasi-random distribution in the local zone around the threshold, providing a valid causal effect assessment (LATE) without the need for randomization. The downsides consisted of limited external validity—the results apply only to "marginal" users fluctuating between buying an old or new flagship—and the necessity to verify the continuity assumption and the absence of point manipulation.

A combined solution was chosen: RDD to assess the pure effect of the feature on marginal users at the device threshold, integrated with Difference-in-Differences with staggered adoption to account for the gradual rollout across product categories (first premium brands, then mass market). To extrapolate results from the threshold to the entire population, Inverse Probability Weighting (IPW) was applied based on the distribution of device prices and demographic characteristics. The final result showed that the true effect was +8% on conversion and -12% on returns, while naive analysis without adjustments demonstrated distorted +35% and -28%, respectively, which critically changed the business decision regarding the scaling of the feature and avoided inflated investment expectations.

What candidates often miss

How to correctly handle network effects (spillover effects) when AR users share photos of virtual try-ons on social media or messengers, influencing purchase decisions of their contacts who do not have compatible devices and formally belong to the control group?

Candidates often ignore SUTVA violations through the social graph, assuming the isolation of groups. In practice, if a friend sees eyewear try-on through Instagram Stories and makes a purchase, it contaminates the control group. The correct approach is applying Two-Stage Least Squares (2SLS) with an instrumental variable (release date of a specific phone model in a specific region) that affects only the presence of AR for the "sender" but not the "receiver" directly. Alternatively, exposure mapping is used, where we model the intensity of social ties between users and introduce interaction treatment × exposure into the model, allowing us to quantitatively assess the direct effect of AR against the indirect effect of virality.

Why is the Intent-to-Treat (ITT) methodology with subsequent calculation of Local Average Treatment Effect (LATE) preferable to attempts to conduct a "forced" A/B test by forcibly turning on the AR feature for a random half of the audience, even if technically possible through cloud rendering?

This question tests understanding of experimental ethics and compliance constraints. Forcibly turning on AR through cloud rendering on incompatible devices creates an artificial UX with high latency and low resolution, leading to catastrophic experiences and mass user churn, violating the "no harm" principle. This creates selection into non-compliance: users will quickly disable the function or uninstall the app, making effect assessment impossible and creating compliance bias. The right approach is encouragement design: instead of forced activation, we randomly display a banner offering to try AR (only to compatible device owners), creating an ITT analysis where treatment is the offer, not actual usage. Then, through IV regression (instrumental variable—the randomization of the offer), we obtain LATE—the effect only for those who truly used the feature (compliers), providing a conservative but causally clean estimate without the risk of technical sabotage of the product.

How to account for catalog coverage bias when AR models are created only for 30% of products, primarily from the premium segment, creating bias in evaluating average baskets and LTV, if analyzing only available SKUs?

Candidates forget the problem of generalizability and truncation bias, comparing the premium segment (where AR is available) with the mass market (where it is not). If we do not adjust the sample, we mistakenly attribute the high basket to the effect of AR, while in reality, we are measuring the difference between price segments. The solution requires using Inverse Probability Weighting (IPW) or Doubly Robust Estimation: first, we model the propensity score—the probability of an AR model being available for a product based on its observable characteristics (price, brand, category, seasonality). Then we weigh observations inversely to this probability, making the AR sample representative of the overall catalog. Additionally, we use synthetic control methods for categories without AR, creating a weighted linear combination of AR categories that mimics the counterfactual behavior of absent categories, allowing us to assess the effect at the business level rather than just on the subset of premium products.