Answer to the question

Historical context

The evolution of e-commerce over the last decade has shifted from static catalogs to interactive formats borrowed from social media. The Stories format, originally popularized by Snapchat and Instagram, has been adapted by marketplaces as a tool to reduce cognitive load when choosing products through short visual narratives. However, unlike classic A/B tests of UI elements, evaluating the effect of ephemeral content faces the problem of cross-contamination (contamination), where a user sees a friend's Stories from the test group, even while being in the control group.

Problem statement

Isolating the pure effect is complicated by three endogenous factors. First, brands self-select based on their ability to produce high-quality video content (larger players launch first), creating a survival bias. Second, network effects within the subscription graph lead to a spillover effect, where the impact “leaks” from the test to the control through social ties. Third, Gen Z users exhibit 3-4 times higher engagement with Stories compared to the 45+ audience, necessitating stratification of analysis.

Detailed solution

The optimal methodology is staggered Difference-in-Differences (DiD) with spatial-temporal variation, where product categories act as clusters of influence, being implemented at different times. To control for network contamination, a leave-out strategy is employed: users with overlapping subscriptions to brands from different categories (treatment and control) are excluded. To correct for brand self-selection bias, Propensity Score Matching (PSM) is used based on historical engagement metrics and audience size prior to implementation. Variance is reduced through CUPED (Controlled-experiment Using Pre-Experiment Data), and the heterogeneity of effect is assessed using Causal Forest, which allows identifying conditional average treatment effects (CATE) for different age segments.

Real-life situation

In a large fashion marketplace, it was planned to implement Stories for brands in the category of “Sportswear” (test group) while maintaining the classic product card in the “Business Clothing” category (control). The issue was that Nike and Adidas (test) had significantly more subscribers than traditional brands (control), and 40% of users were subscribed to brands in both categories, creating strong contamination. It was necessary to assess the effect on 7-day retention (D7 retention) and purchase conversion within 48 hours after viewing Stories.

Option 1: Simple before-after comparison for the test category

Analysts suggested comparing metrics of the sports category for a month before and after the launch of Stories. The advantages of this approach included immediate results and no need for complex infrastructure. The downsides were critical: the inability to separate the effect of the format from the seasonal increase in demand for sportswear in January (New Year Resolution effect) and from the marketing campaigns of brands launched simultaneously with the new functionality.

Option 2: Classic A/B test at the user level with a 50/50 split

This option suggested randomly splitting users for visibility of Stories regardless of category. The benefits included a clean experimental design and ease of interpretation. The downsides included technical impossibility (content created by brands, not the platform) and ethical constraints: hiding content from part of the brand's followers undermined the monetization model and led to complaints from advertisers.

Option 3: Staggered DiD with synthetic control matching and filtering of network ties

It was decided to utilize temporal variation of implementation (sports category — week 1, streetwear — week 3, classic — week 6) and to build a Synthetic Control based on a weighted combination of categories that had not yet received the function. To eliminate contamination, users with overlapping subscriptions >15% of the total were excluded (threshold determined through the analysis of the social graph). CUPED was applied to correct for historical D7 retention.

Chosen solution:

The team chose Option 3, supplemented with Causal Forest for age segmentation. This allowed not only to isolate the pure effect but also to understand for whom Stories work best. A key factor in the choice was the possibility of preserving business processes (all subscribers see the content), while simultaneously obtaining a valid causal evaluation.

Final result:

The analysis revealed a statistically significant incremental increase in D7 retention of 8.4% (p < 0.01) for the 18-25 age segment with no effect for 45+. However, a negative spillover was found: users who saw more than 5 Stories in a session demonstrated a 3% decrease in purchase conversion (oversaturation effect). Based on this data, the product team implemented an adaptive algorithm to regulate the frequency of Stories display by age, which led to a 4.2% growth in GMV in the test category without harming the user experience of older cohorts.

What candidates often overlook

How to correctly account for the negative spillover effect when an excess of one brand's Stories reduces receptiveness to content from other brands in the same session?

Candidates often focus only on positive network effects, ignoring oversaturation. A correct approach requires session-level analysis, rather than user-level: splitting sessions into "high Stories density" (>3 unique brands) and "low density", then assessing the interaction effect between treatment and content density level. If the coefficient is negative and significant, this indicates cannibalization of attention within the format. It is also necessary to check temporal dynamics: whether users develop "resistance" (ad stock) to the format over time through decomposing the effect by weeks of implementation.

How to separate the effect of the Stories format from the effect of content quality, if brands with high production value self-select in the first waves of implementation?

Standard DiD will not solve the problem, as brand characteristics correlate with the initial level of metrics. The use of Instrumental Variables (IV) is required: the threshold value of the number of brand subscribers at which the Stories function becomes available (e.g., >100k followers) serves as an instrument. This creates random variation around the threshold (regression discontinuity design, RDD), allowing comparison of brands with 99k and 101k subscribers, which are statistically identical in terms of content quality but differ in access to the tool. Thus, the pure effect of the format is isolated, rather than the quality of creatives.

Why are standard metrics like click-through rate (CTR) and view-through rate (VTR) insufficient for assessing the long-term effect of ephemeral content, and what metrics should be used instead?

Candidates focus on immediate engagement, missing the attribution of delayed purchases. Stories disappear after 24 hours but create a "tag" in the user's memory (mental availability). Correct evaluation requires constructing a Surrogate Index: using intermediate metrics (app opening frequency over 7 days, addition to Wishlist without purchase) as proxies for long-term LTV. The Long-term Causal Effects method is applied through a two-step assessment: first, the relationship between surrogate and final LTV is modeled using historical data, then this relationship is applied to experimental data. This allows capturing the effect of "delayed conversion", when a user sees Stories but purchases a week after the content disappears.