Answer

The historical context of such changes dates back to 2017 when Netflix abandoned the five-star scale in favor of binary 'thumbs up/down', followed by YouTube with the hiding of dislikes. These changes were motivated by the fact that five-star ratings demonstrated 'Asian' inflation (clustering around 4-5 stars) and poorly correlated with actual content consumption. The problem lies in isolating the pure effect of changing the feedback collection mechanism from confounding factors: seasonality of categories, self-selection of active users, and temporal degradation of Collaborative Filtering models due to sparsity of new signals.

To address this, Staggered Difference-in-Differences (DiD) is used with content categories, where treated categories are compared with those not yet transitioned (control), accounting for different implementation times. For categories without direct analogs, the Synthetic Control Method is used, creating a weighted combination of control categories to mimic the counterfactual. The endogeneity of self-selection of evaluating users is corrected through Heckman Correction or Propensity Score Matching based on viewing history and tenure. To evaluate the quality of recommendations, Counterfactual Evaluation is applied with metrics NDCG and MAP on hold-out samples, excluding the burn-in period lasting 2-4 weeks to stabilize the factor matrix.

Real-life Situation

The streaming service 'CinemaFlow' planned to replace its outdated five-star system with a binary one to increase engagement. The key issue was that the team suspected a loss of predictive power of recommendations due to reduced granularity of signals, as well as fears of a sharp decline in user activity accustomed to a detailed scale. It was necessary to find an evaluation method that accounted for the gradual rollout by genres (first documentaries, then comedies) and network effects, where the visibility of existing ratings affected new users' willingness to vote.

A classic A/B testing approach with user_id level segregation was considered. The pros included the purity of the experiment and simplicity in interpreting the causal effect. The cons were critical: the Collaborative Filtering algorithm lost integrity due to the mixing of two types of signals in one matrix, creating artifacts in recommendations for both groups; there was a risk of cross-contamination through social features (users saw friends' ratings from another group); the business feared negative reactions to fragmented UX within a single product.

An alternative was a pre/post analysis comparing metrics before and after the transition for each category separately. The pros were technical simplicity and no need to keep the old system for part of the users. The cons included the inability to separate the intervention effect from seasonal fluctuations in views (e.g., Christmas movies rated differently in December), ignoring the herd behavior effect and self-selection of early adopters of the new system, leading to biased estimates.

A hybrid approach of Staggered DiD with Synthetic Controls and Instrumental Variables was chosen. This method allowed the use of categories that had not yet transitioned to the binary system as controls for those that had, correcting for temporal trends. Synthetic Control compensated for the heterogeneity between genres, while the IV approach using the time of day for content posting (when fewer online users and weaker herding) as an instrument helped isolate the pure effect of the rating interface. The choice was based on the need to maintain the functionality of the recommendation system during the transition and to obtain unbiased estimates with partial data availability.

The final result showed that the volume of ratings increased by 220% due to reduced cognitive load, but the accuracy of recommendations (measured as NDCG@10) fell by 12% in the first three weeks. This period corresponded to the model retraining of Matrix Factorization, after which metrics recovered to baseline due to increased matrix density. Based on this data, the product team decided on a full rollout with an additional budget for cold start for new users.

What candidates often overlook

How to properly account for the period of degradation of recommendation quality during model retraining and separate it from the true effect of the new system?

Answer: It is necessary to formalize the concept of a 'burn-in period', usually 2-4 weeks, during which recommendation quality metrics are excluded from the main causal analysis. Use Counterfactual Evaluation on historical hold-out sets, comparing offline metrics (NDCG, MAP, Precision@K) before and after transition, but stratified by user activity level. It is important to track coverage and diversity metrics separately from accuracy, since binary signals can increase popularity bias with insufficient regularization.

How to handle endogeneity of self-selection of users willing to leave ratings under the new system, and distinguish their behavior from the effect of the interface itself?

Answer: Users rating content under the binary system systematically differ from 'star' raters (prone to extreme preferences). Apply Heckman Correction (two-step model with selection equation) or Inverse Probability Weighting based on propensity scores calculated from observed characteristics (viewing history, tenure, session time). As an Instrumental Variable, use random variations in the interface (order of like/dislike button placement) or A/B testing of visibility of aggregated ratings to isolate the pure effect of the data collection mechanism.

How to quantitatively assess the herd behavior effect and separate it from true user preference when analyzing the volume of ratings?

Answer: Split users into 'first movers' (first-seers) seeing an empty rating counter and 'followers' seeing a non-zero number of votes. Apply Regression Discontinuity Design (RDD) around rating visibility thresholds (e.g., when content enters the top-10 category). Compare the likelihood of rating among users seeing the aggregated result to those seeing 'be the first'. For dynamic adjustment, use Thompson Sampling or Bayesian methods to assess true content quality, filtering network effects through time lags between publication and rating.