Answer to the question

Historically, food delivery services have evolved from the '60-minute delivery' model to hyper-local logistics with precise time slots. This transition creates a methodological problem: restaurants with initially high operational efficiency (short cooking times, proximity to high-order density areas) self-select into the first waves of implementation, while problematic points are onboarded later or never. Directly comparing conversion before and after implementation leads to an overestimation of the effect, as it ignores systematic differences between early adopters and laggards.

The problem is exacerbated by geographical clustering: restaurants in the city center, where demand is high and stable, tend to gain access to the feature before peripheral locations with volatile demand. Seasonal fluctuations (e.g., holiday spikes or summer lulls) further distort observed trends, making it impossible to use a simple between-group difference in means.

To isolate the true effect, a combination of Difference-in-Differences (DiD) with fixed effects for restaurants and time should be used, supplemented by Propensity Score Matching (PSM) to eliminate self-selection bias. In the first stage, a model of the probability of connecting to the precise slot system is constructed based on covariates (historical delivery time, rating, density of couriers within a radius), after which each processed restaurant is matched with a control 'twin' from among those not yet connected. Then, the double difference in conversion dynamics between these pairs is estimated, controlling for unobserved constant characteristics (e.g., kitchen quality). To account for spatial correlation, clustering of standard errors at the level of geographical cells should be applied, or the Synthetic Control Method can be used, creating a weighted combination of unconnected restaurants that mimics the counterfactual scenario for treated units.

Real-Life Situation

In the largest federal delivery aggregator, the implementation of the 'Delivery in a Selected 15-Minute Interval' feature for premium restaurants was planned. The pilot launched in three cities, where the first 15% of partners with historically low cooking times and high ratings connected. After a month, analysts noted a 22% increase in conversion among connected restaurants, but the business was skeptical about whether this was the effect of the feature or simply a reflection of the initially high quality of these points.

Three approaches to evaluation were considered. The first option – a simple comparison of average checks and conversions before and after connection – was immediately rejected: it ignored the upward trend of the market and seasonal surges in demand during holidays, which led to an inflated estimate of +22%, but did not account for the fact that these restaurants were growing faster than the market by 8-10% even without the new feature.

The second option – a cohort analysis comparing users who saw precise delivery times with those who saw the standard '40-50 minutes' – also proved problematic: users in areas with premium restaurants had higher average checks and loyalty initially, creating selection bias. Attempting to trim the sample by geography would lead to a loss of 40% of data and reduced test power.

The third option, which was ultimately chosen, involved building a Synthetic Control for each connected restaurant based on 50 unconnected 'donors' with similar sales history, geography, and seasonality. The DiD methodology was applied to these weighted synthetic groups with additional control for weather conditions (which affected delivery demand) and days of the week. This allowed the isolation of the net effect of +9.3% on conversion and +14% on repeat order frequency, revealing heterogeneity: the effect was significant only for restaurants with cooking times of less than 12 minutes, whereas for slower kitchens, the precise time slot delivery did not provide a statistically significant increase as the bottleneck remained production, not logistics.

What Candidates Often Overlook

How to verify the assumption of parallel trends in DiD when early adopters systematically differ from the control group?

Candidates often claim to apply DiD without checking the key assumption: before the implementation, trend metrics in treatment and control groups should be parallel. In conditions of self-selection, this assumption is typically violated. It is necessary to conduct an event study (dynamic DiD) with lead indicators for several weeks before implementation. If the coefficients for these indicators are statistically significant and different from zero, the trends are not parallel, and it is necessary to apply Augmented DiD or add trend interactions to control for differential trends. A Change-in-Changes model can also be used, which is less sensitive to the violation of parallelism but requires monotonicity of outcome distribution.

How to account for spatial spillover effects when the implementation of precise delivery in one area affects user behavior in neighboring areas without the feature?

Analysts often ignore that users can migrate between neighborhoods or change their preferences upon learning about the feature from friends. This creates positive bias in the control group (SUTVA violation). To diagnose, it is necessary to build a Spatial DiD, including spatial lags of the concentration of connected restaurants within a 1-2 km radius from each point in the model. If the coefficient for the spatial lag is significant, there are network effects. In this case, classical DiD estimation yields an underestimated effect (attenuation bias), and it is necessary to use Two-Stage Least Squares (2SLS) with instruments at the level of administrative constraints (e.g., technical readiness of a specific warehouse to sort by time slots) that affect the restaurant's connection but do not correlate with demand in neighboring areas directly.

Why can’t simple Propensity Score Matching be used without subsequent DiD, and what errors occur when evaluating long-term effects (dynamic treatment effects)?

Junior specialists often apply PSM as a standalone method, obtaining comparable groups at time t0, but then compare them using simple means at t1. This ignores the temporal structure of the data and potential time shocks. The correct approach is PSM-DiD, where matching is used only to select the control group, while the effect estimation occurs through the difference in differences. Additionally, candidates overlook the problem of dynamic effects: the effect of precise delivery may build up over time (users get used to the feature) or, conversely, fade away (novelty effect). For this, it is necessary to construct a staggered DiD with multiple periods of implementation and use modern adjustments to eliminate bias arising from heterogeneous effects over time (e.g., the Callaway & Sant'Anna or Sun & Abraham methods for correctly aggregating cohort effects), as standard two-period DiD in such cases yields a biased estimate of the average effect on treated (ATT).