Historical context: the concept of social proof dates back to the work of Robert Cialdini in the 1980s, but the mass adoption of real-time notifications in digital products began in 2015 with the development of WebSocket connections and Kafka-like streaming platforms. Classic A/B testing methods often yield biased estimates here due to network effects (SUTVA violation), where one user's result depends on the presence of other users online. Early attempts at evaluation boiled down to a simple comparison of sessions with a visible widget and without, which led to significant sample endogeneity.
Problem: in assessing the effect, it is necessary to separate the true influence of the intervention from the endogenous variable of audience density. If sessions with notifications and without are simply compared, we obtain selection bias: at peak hours, conversion is already higher, and at that moment, the system generates more notifications. Additionally, the migration of users between mobile apps and desktop creates contamination, blurring the line between treatment and control.
Solution: the optimal approach is difference-in-differences (Difference-in-Differences, DiD) estimation with two-way fixed effects across time zones and product categories, supplemented by an instrumental variable (IV-approach) for audience density. As an instrument, an exogenous shock from weather conditions or regional internet outages affecting online activity but not directly related to conversion is used. Alternatively, the Synthetic Control Method is applied, where the control group is constructed from similar products/regions without the feature implementation, weighted by conversion history and seasonality.
In an electronics marketplace, the implementation of a widget "15 people are currently viewing this product" with real data from ClickHouse streaming was planned. The problem was that the product team recorded an 18% increase in conversion during peak hours but could not separate the effect of notifications from the naturally high demand in the evening. Additionally, the effect of the "empty room" was observed: during night hours, the widget displayed zeros or outdated data, which could potentially reduce trust.
The first option considered was a classic A/B test with geographical segmentation. Pros: simplicity of implementation and clear interpretation. Cons: network effects are blurred as users from different cities see different assortments and base conversions; moreover, at low audience density in small towns, the widget displayed "0 people currently viewing", which created negative social proof and reduced trust.
The second option was regression discontinuity design (Regression Discontinuity Design, RDD) based on the timing of feature launch in a specific region. Pros: clear causal identification at the cutoff moment and the possibility of visual checking on a graph. Cons: it is impossible to separate the novelty effect from the constant effect; additionally, gradual rollout across time zones created a blurred boundary of treatment, violating a key assumption of RDD about a sharp change in treatment probability.
The third option was a quasi-experiment using products without real-time functionality as a control group (DiD). Pros: taking seasonal trends into account through fixed effects; ability to assess effect heterogeneity based on baseline traffic level. Cons: requires the assumption of parallel trends, which was verified through the Event Study specification with leads and lags.
The solution with DiD and an instrumental variable based on weather data was chosen: rainy days in regions unexpectedly increased online activity (satisfying the relevance of the instrument) but did not directly influence the desire to buy a phone (exclusion restriction). The analysis showed that the true effect of the widget was +9% conversion only at densities >30 online users per SKU; at lower densities, the effect was negative (-4%) due to the demonstration of "empty" or outdated data.
Based on these results, an adaptive algorithm was implemented, disabling social proof during low traffic. The result was an optimization of display rules: the system switched from constant display to conditional, increasing average conversion by 7% across the platform and reducing churn from the "night user" segment by 12%. The savings on infrastructure resources amounted to 15% due to the disabling of stream processing for inactive products.
How to separate the effect of the mechanism (intensive margin) from the overall effect of the feature's presence (extensive margin)?
Candidates often confuse reduced form estimation (just the presence of the system) with mechanism assessment (how changes in density within treatment affect the outcome). The correct approach is two-stage estimation (Two-Stage Least Squares, 2SLS), where in the first stage, the actual frequency of notification displays is predicted using the instrument (weather), and in the second stage, conversion is predicted from the forecasted frequency. This allows isolating the pure effect of the notification from the "crowd" effect (herding behavior), which has reverse causality: high conversion attracts more views, creating more notifications.
Why is correction for multiple testing important when analyzing heterogeneity by density segments and time of day?
Analysts often seek the optimal threshold for enabling the feature by testing the effect at 10, 20, 50 users, and select the threshold with maximum uplift. This leads to the problem of data mining and inflated Type I error. It is necessary to apply Bonferroni or Benjamini-Hochberg procedure for family-wise error rate, or to use a pre-analysis plan to fix hypotheses prior to analysis. Otherwise, the "optimal" threshold may simply be a random outlier in the data.
How to account for negative spillover to the control group through common inventory and user budget limitation?
In social proof within a marketplace, there is a demand pulling effect: if a widget accelerates purchases in the treatment group of products, this can reduce conversion in the control group due to budget exhaustion or distraction. Candidates often ignore General Equilibrium Effects. To adjust for this, estimation requires aggregated data at the user session level (aggregate treatment effects) or the use of market equilibrium models, accounting for user attention limitations.