Historical context. The co-browsing concept migrated from the B2B sector (customer support) to social commerce (e.g., 'Shop Together' features in mobile apps). Traditional analytics relied on the SUTVA (Stable Unit Treatment Value Assumption), which assumes user independence. However, social features violate this assumption, as the actions of one user can influence the behavior of their connections, rendering classical A/B testing methodologically incorrect.
Problem statement. Standard average comparisons (difference-in-means) provide biased estimates due to interference: control group users invited by friends from the test group alter their behavior, creating spillover effects. Self-selection based on social activity distorts the distribution of covariates, and staggered rollout introduces temporal confounders, such as seasonality and novelty effects that correlate with the timing of cohort participation.
Detailed solution. It is necessary to apply cluster randomization (cluster randomized trial) at the level of the social network graph, using community detection algorithms (Louvain or Leiden) to create clusters with minimal connectivity between them. If full randomization is unfeasible, use difference-in-differences with staggered implementation (staggered DiD), correcting for heterogeneous effects through Callaway-Sant’Anna or Sun-Abraham methods, which correctly handle negative weights of early cohorts. To isolate the direct effect from the network effect, apply exposure modeling: determine the degree of 'infection' in the control group as the share of friends in the test group and include this as a covariate in the regression, or use 2SLS (two-stage least squares) with an instrumental variable (availability of the feature by geographical cluster as IV for actual usage). For analyzing time to conversion, a Cox model with frailty effects (shared frailty model) that accounts for risk clustering within social groups is suitable.
Problem description. A marketplace launched a 'Buy Together' feature that allows two users to simultaneously browse a catalog and edit a shared cart in real time. A pilot on 10% of the audience showed an 8% increase in conversion, but the team suspected overestimation: control group users received invitations from friends in the test group, creating inter-group contamination. Additionally, the feature was primarily used by those who already had established social ties (self-selection based on engagement).
Option 1: Simple 'before/after' comparison on the adapter group. This approach compares the metrics of users who started using co-browsing with their historical data or similar users without the feature. The advantages are obvious: the calculation takes minutes, is easily interpreted by the business, and does not require complex experimental infrastructure. However, the downsides are critical: the method completely ignores seasonality and maturation effects, and suffers from self-selection bias since socially active users initially have a higher baseline conversion rate.
Option 2: Intent-to-Treat (ITT) analysis with randomization of button availability. Here, we randomly provide the opportunity to invite friends to different cohorts, regardless of whether they use it, and compare the final metrics. Advantages include maintaining statistical randomness of assignment and the ability to assess the overall effect of the launch policy, including network externalities. Disadvantages are related to dilution of effect due to mismatch: many will gain access but not use the feature, requiring a sample increase of 3-4 times; moreover, ITT does not answer the question of effectiveness for actual users (TOT).
Option 3: Regression Discontinuity Design (RDD) based on the threshold number of friends. The method uses a sharp threshold (e.g., 5 friends) for activating the feature, creating a quasi-experiment around the cutoff point. Advantages include local randomness of assignment near the threshold and no need for full randomization of the entire audience. However, there are significant downsides: the effect is local only for 'marginal' users, manipulation (gaming fake friends) is possible, and the method does not resolve the problem of contamination among users on different sides of the threshold if they are connected.
Chosen solution and justification. Option 2 with cluster randomization was chosen: analysts constructed a social connections graph, applied the Louvain algorithm to highlight dense communities, and randomized access at the community level rather than the user level. This minimized contamination between test and control. For estimation, they used an exposure model: for each user, they calculated the share of friends in test clusters (spillover intensity) and included it as a regressor. This allowed the separation of the direct effect of the feature and the indirect influence through social proof.
Final result. The true direct effect (TOT) was +3.2% to conversion (instead of 8% in the raw estimate). However, a significant positive spillover on the control group (+1.8%) was identified, attributable to social influence from the invitations. The overall policy effect (ITT) turned out to be +2.1%. Without considering network effects, the team would have underestimated the value of the feature, dismissing the project as 'not effective enough,' while accounting for spillover showed the feature paid off in 4 months.
1. Why does the standard A/B test provide biased estimates for social features? The standard test assumes SUTVA: the impact on one user does not influence others. In co-browsing, this is violated: a control user receiving an invitation from a test user changes behavior (spillover), creating interference bias. ATE (Average Treatment Effect) estimates become a weighted mix of direct and indirect effects, often tending towards zero. Solution: use cluster randomization (randomization at network-cluster level) or inverse probability weighting methods to adjust for network structure.
2. How to statistically separate direct effect, spillover effect, and total effect? Candidates confuse ITT (Intent-to-Treat) and TOT (Treatment-on-Treated): ITT estimates the effect of offering the feature to the entire cohort, including those who did not use it, while TOT isolates the effect for actual users. To separate effects, apply Principal Stratification: classify users by compliance types (compliers, always-takers) and estimate CACE (Complier Average Causal Effect). Spillover is estimated through exposure mapping, where spillover intensity is proxied by the share of connections in the test. The total effect is a weighted sum of direct and indirect effects based on the exposure distribution.
3. Why is standard DiD (Difference-in-Differences) incorrect for staggered rollout? In staggered implementation, early cohorts serve as controls for later ones, but later cohorts never serve as controls for earlier ones, creating a negative weighting problem with heterogeneous effects. Classic two-period DiD in such a design yields biased estimates, as it mixes effects from different periods with incorrect weights. Instead, one should use Callaway-Sant’Anna or Sun-Abraham estimators, using only never-treated or not-yet-treated observations as control. An alternative is the Synthetic Control Method for each cohort separately, built on a donor pool of never-treated groups.