Answer to the question

Traditional pricing methods in e-commerce have long relied on simple correlation analyses or short A/B tests to assess changes in delivery thresholds. However, with the advancement of causal inference theory, it has become clear that abrupt changes in delivery policy for the entire user base create problems of endogeneity and temporal dynamics. Modern product analytics requires the application of quasi-experimental methods such as the Synthetic Control Method (SCM) and Bayesian Structural Time Series (BSTS), which were developed to evaluate macroeconomic policies but have been successfully adapted for digital products with high metric volatility.

When raising the free shipping threshold, a complex problem arises in identifying the local average treatment effect (LATE). Users with a high purchase readiness change their behavior (buying up to the threshold), while marginal users postpone purchasing or switch to competitors. A classical before-after analysis provides a biased estimate due to seasonality, inflationary effects, and competitive campaigns. Additionally, the effect of intertemporal substitution is observed, where users combine purchases over time, creating an artificial spike in average check size that is not related to a true increase in demand, necessitating the modeling of the temporal response structure.

The optimal approach is a combination of Synthetic Control Method at the level of aggregated user cohorts and Regression Discontinuity Design (RDD) for a local assessment of the effect on marginal consumers. For SCM, a weighted combination of geographic regions or segments with similar historical dynamics is created that mimics the trend of the target group before the intervention, using the optimization algorithm Abadie-Diamond-Hainmueller. For RDD, transactions in a narrow band around the threshold (optimal bandwidth via the Imbens-Kalyanaraman algorithm) are analyzed, which allows isolating the pure effect of the stimulus. Additionally, CausalImpact based on BSTS is applied for dynamic assessment of deviation from the synthetic trend, and statistical significance is calculated through permutation tests (placebo tests) on historical data.

Real-life Situation

A large fashion marketplace decided to raise the threshold for free shipping from 1500₽ to 2500₽ for its entire audience in Russia simultaneously. The product team noted a 22% increase in average check size in the first two weeks, but the CFO doubted the sustainability of this effect, fearing a loss of valuable users and cannibalization of future sales through delayed purchase mechanisms. The analyst faced the challenge of separating the true causal effect from the noise of seasonal sales and changes in competitor behavior, which launched parallel promotions on delivery.

The first option considered was a simple comparison of metrics for 30 days before and 30 days after the change using a t-test and calculating uplift as a percentage. Pros: maximal implementation speed in one day and high clarity for top management without delving into statistics. Cons: complete neglect of the upward seasonal trend (beginning of the spring collection), absence of control over external shocks (a competitor's advertising campaign), and the inability to assess the dynamic effect of cart accumulation, leading to an overestimation of the effect by 40-60%.

The second option was Geographic Difference-in-Differences, using regions with no changes to the threshold (e.g., remote areas with logistical constraints) as a control group. Pros: natural variation and the ability to capture regional differences in price sensitivity through fixed effects. Cons: critical violation of the parallel trends assumption due to user migration between cities (violating SUTVA) and significant differences in the competitive environment between capitals and regions, making the control group systematically incomparable.

The third option was the Synthetic Control Method at the level of user cohorts formed by historical purchase frequency and average check size, built on data from 12 months prior to the change. Pros: creation of an optimal weight set of "donor" segments, accounting for seasonality, day of the week, and trends through convex combinations; the ability for visual validation of fit quality during the pre-treatment period. Cons: requirement for a long data history (at least 10-15 periods), sensitivity to structural breaks such as pandemic behavioral changes, and complexity in interpreting weights for the business.

A combined solution was chosen: SCM for assessing the overall effect on revenue and RDD with a second-degree local polynomial for assessing the effect on marginal users in the band of 2300-2700₽. This allowed separating the "top-up" effect (basket augmentation) from the "churn" effect and correctly accounting for seasonality through a Bayesian structural time series model integrated into CausalImpact.

The final result showed that the observed increase in check size of 22% was overestimated by about double: the true incremental effect was 11%, with 6% attributable to temporary demand shifting (intertemporal substitution) and 5% to a true increase in basket size. The analysis revealed a segment of "delivery-sensitive" users (15% of the base), demonstrating an increased churn of 8% and a 12% decrease in order frequency, which allowed for the adjustment of the policy: to introduce a hybrid threshold of 1990₽ for the low check segment with high historical return rates, mitigating the negative effect on retention.

What Candidates Often Miss

How to correctly account for the effect of cart pooling and intertemporal substitution of purchases when assessing the dynamic delivery threshold, if users strategically delay conversion?

Answer: It is necessary to model the temporal structure of decision-making through survival analysis (Cox model with proportional hazards) or inter-purchase time analysis. The key metric becomes not point conversion but the change in the hazard rate of purchasing depending on the current cart amount and distance to the threshold. Additionally, user cohorts that reached the threshold through top-ups should be analyzed for an increased return rate of goods within 14 days (return cannibalization), which distorts the GMV metric and requires adjustment for the return rate in the model.

Why are standard confidence intervals incorrect for the Synthetic Control Method, and how should statistical significance of causal effects be assessed in this methodology?

Answer: In SCM, estimates are subject to inferential uncertainty, related to the process of selecting weights for donor units and the finiteness of the sample, which violates classical frequentist statistics assumptions of independence of observations. The correct approach is a permutation test (placebo test), where the same SCM algorithm is applied to each donor unit from the pool (pretending they received the treatment), creating an empirical distribution of placebo effects. The effect is deemed statistically significant at the 5% level if the post/pre-RMSPE ratio for the treated unit exceeds the 95th percentile of the placebo distribution, as formalized in the work of Abadie, Diamond, and Hainmueller (2010, 2015).

How to distinguish the effect of changing the delivery threshold from simultaneous changes in traffic quality or competitive activity when using Causal Impact or Synthetic Control?

Answer: It is critically important to include covariates (predictors) that are not influenced by the intervention (untreated confounders) but correlate with the target metric — for example, traffic to competitors’ websites (through SimilarWeb or panel data), total e-commerce market size in the region, or organic traffic CTR. In the Bayesian structure of BSTS, which underlies CausalImpact, these variables enter as regressors into the state-space model, isolating common shocks. It is also necessary to test Granger causality between predictors and outcomes before the intervention and to use placebo-in-time tests, shifting the "treatment" date to historical periods to check for the absence of false positives.