Historical Context. Since the 2020s, the evolution of e-commerce has shifted the focus from same-day delivery to sustainable logistics, where order consolidation reduces carbon footprint and last-mile costs. Early experiments like Amazon Day and similar services have shown that voluntary delivery consolidation leads to self-selection of users with low urgency for consumption, creating endogeneity in evaluating the effect on product metrics. Traditional A/B testing methods are inapplicable for mandatory consolidation, as logistical infrastructure requires mass route optimization at the zone level, rather than at the individual user level.
Problem Statement. When implementing a consolidation system (e.g., delivery only on Tuesdays and Fridays), there arises an issue of lack of random distribution: users in implemented zones systematically differ in geographical distance from warehouses and time preference. Additionally, there is a risk of spatial spillover, where users change their delivery address to work or relatives in neighboring regions without consolidation, violating the SUTVA (Stable Unit Treatment Value Assumption). Seasonality of demand and correlation of launches with logistics optimization in high-income regions further distorts the estimation of the true causal effect.
Detailed Solution. To isolate the effect, Staggered Difference-in-Differences (DiD) with gradual rollout across logistics zones is used, where pre-implementation periods serve as controls for post-implementation periods. It's crucial to test the parallel trends assumption through event study analysis of metric dynamics before implementation to ensure there are no differential trends between future treatment and control groups. For each zone, a Synthetic Control is constructed from donor regions with similar historical order dynamics but without planned implementation, which allows simulating the counterfactual and increasing the robustness of the evaluations.
To adjust for partial compliance, IV regression (Instrumental Variables) is used, where the instrument (Z) is the fact of user belonging to the implementation zone (assignment), predicting actual use of consolidation (D), while the outcome (Y) is retention or purchase frequency. This allows estimating the LATE (Local Average Treatment Effect) — the effect for those who changed behavior due to implementation (compliers), as opposed to the ITT (Intent-to-Treat), which shows the effect of offering the service. Heterogeneity analysis by product categories (impulse vs stock-up goods) helps separate true demand decline from intertemporal substitution.
A home appliance marketplace launched a pilot for delivery consolidation in three major cities aiming to reduce logistics costs by 30%. Analysts faced distortions when comparing users who agreed to consolidation (treatment) with non-adopters (control): adopters historically had lower purchase frequency and higher average order values, indicating self-selection of planning buyers. A simple comparison would show a false decline in retention, whereas actual behavior could have been stable but distorted by selection bias.
First Option — a direct comparison of metrics before and after implementation (pre-post analysis) within the zone. The pros here lie in simplicity of implementation and speed of results without the need for data collection from other regions. The cons are obvious: it is impossible to separate the effect of consolidation from seasonal demand fluctuations and general trends in the user base growth, leading to systematic bias in estimation when the launch coincides with holiday periods or promotional campaigns.
Second Option involves cross-sectional comparison of zones with and without implementation at a fixed date. Advantages include the ability to control for temporal trends through a snapshot of data and no need for a long history in control regions. Disadvantages are tied to the fact that regions for implementation were chosen based on high order density and audience loyalty criteria, creating significant selection bias and making groups incomparable in initial characteristics.
Third Option employs Staggered DiD with propensity score matching and Synthetic Control. The pros include using regions without implementation as a control group, allowing retention of regional and temporal fixed effects, while matching improves comparability of pre-trend characteristics. The cons involve the complexity in validating the parallel trends assumption under heterogeneous time effects and the risk of spatial correlation between neighboring zones, where users might change delivery addresses.
Chosen Solution and Outcome: The third approach was selected, with additional use of IV regression on the boundaries of logistics zones (RDD-style boundary analysis) for local validity. This allowed isolating the effect from regional differences in purchasing behavior and service levels. The analysis showed that the true effect of consolidation is an 8% decrease in transaction frequency (not 15% as in naive analysis), but a 22% increase in average order value due to merging small orders. Retention remained at the control group level, justifying scaling the function to other regions with predicted economic impact.
As a result of the implementation, the company reduced logistics costs by 35% due to route optimization, compensating for the decline in frequency of orders with an increase in average order value. A forecasting model based on the obtained coefficients enabled calculating the breakeven point for launching in new regions with varying population density. The methodology was adopted as a standard for assessing logistics innovations where traditional A/B testing is infeasible.
How to distinguish true reductions in purchase frequency from intertemporal substitution, when users simply postpone purchases until the next delivery window?
Candidates' answers often ignore the dynamic nature of demand and assume that a frequency decrease within a month is equivalent to losing a customer. It is necessary to analyze user cohorts with a long lag (180+ days) and differentiate product categories: for perishable or impulse items (snacks, accessories), postponement equates to loss, while for scheduled purchases (household appliances), it is merely a temporal shift. Methodologically, distributed lag models should be used, or stockpiling behavior should be analyzed through the metric of inventory days at home, calculated based on purchase history of regularly consumed categories. If the total quantity of items over 90 days decreases — this is demand loss, if it remains steady but the interval between orders increases — this is substitution.
How to account for spatial contamination (spillover effects), when users change delivery addresses to work or friends in a neighboring zone without consolidation to receive goods faster?
The standard DiD assumes no treatment influence on the control group, but in practice, users from 'treatment' may utilize addresses in 'control' for urgent orders, distorting control metrics upward. The solution is a geographic filter: analyze only users with a "stable" home address (history >6 months without changes) and exclude hybrid orders (delivery to another zone). Alternatively, use spatial DiD with weights inversely proportional to the distance from the zone boundary, or analyze only regions more than 50 km away from the boundaries (donut RDD), where spillover is minimal.
How to correctly interpret the difference between ITT (Intent-to-Treat) and LATE (Local Average Treatment Effect) in the context of partial compliance, when not all users in the implementation zone utilize consolidation?
Candidates often mix the effect of "service provision" and "actual usage". ITT estimates the effect on all users in the implementation zone, including those who ignored the function, and is useful for a business case on scaling. LATE (through IV regression with the instrument "availability of service in the area") estimates the effect only for compliers — those who changed behavior due to implementation. If compliance is low (e.g., 30% using consolidation), ITT will be underestimated by a factor of three relative to the true effect for users of the function. It is important to report both indicators: ITT for predicting the overall business effect upon scaling, LATE for understanding value for a specific segment making the decision to use it.