Product Analytics (IT)Product Analyst

Develop an approach to assess the causal effect of implementing an ML-based fraud detection system on the conversion of legitimate users and net revenue, given that score thresholds create a discontinuity in the probability of transaction approval, and it is impossible to completely disable filters for the control group due to reputational risks and regulatory requirements?

Pass interviews with Hintsage AI assistant

Answer to the question

Historical Context

Traditionally, fraud prevention in digital products was based on strict rule-based systems or manual moderation, which led to high operational burden and system rigidity. With the development of machine learning, companies began to implement Real-Time Fraud Detection SDKs, which score every transaction for the probability of fraud. The key challenge is that any classifier makes two types of errors: False Positive (blocking a legitimate user) directly reduces revenue, while False Negative (missing fraud) increases chargebacks. It is critical for the business to measure the trade-off between these errors to optimize scoring thresholds.

Problem Statement

Standard A/B testing is impossible, as intentionally allowing fraudulent transactions in the control group is unacceptable from a reputation and FinCEN/PCI-DSS compliance perspective. A simple comparison of metrics before and after implementation is distorted due to the seasonality of fraud attacks and user self-selection (more loyal users are the ones updating the app). High fraud-risk users have initially different conversion rates than low-risk users, so a naive comparison between approved and rejected transactions gives a biased estimate due to confounding by indication.

Detailed Solution

The optimal method is Sharp Regression Discontinuity Design (RDD) around the fraud score threshold (e.g., 0.7), where there is a sharp change in the probability of approval from 1 to 0. We compare transactions with a score of 0.69 (treatment, approved) and 0.71 (control, rejected), assuming local randomness within the bandwidth window (±0.05). We use Local Linear Regression with a triangular kernel to estimate LATE (Local Average Treatment Effect). To improve accuracy, we apply Covariate-Adjusted RDD, adding predictors (device history, geo) as control variables. To assess net revenue, we calculate Incremental Revenue: the difference between prevented fraud (expected chargeback) and lost revenue from false positives identified through RDD.

Real-life Situation

In a mobile marketplace app, after integrating an external vendored Fraud Detection SDK, the overall conversion to purchase decreased from 4.2% to 3.5%, while the fraud rate dropped from 2.8% to 0.4%. The product team suspected that the system was too aggressive and cutting off legitimate paying users but could not quantitatively assess the scale of the problem due to the lack of a control group.

Option A: A simple pre-post conversion comparison. Pros: minimal labor costs, does not require special infrastructure. Cons: completely ignores seasonality (the period after implementation coincided with the onset of a low season), self-selection when updating the app, and changes in the marketing mix (a new channel with low conversion was launched).

Option B: Geographic segmentation (cities Group A with the system enabled, Group B without). Pros: creates a clean control group. Cons: technically impossible due to a unified codebase and CDN caching; users migrate between cities; fraud profiles differ significantly across regions (horizontal heterogeneity).

Option C: Regression Discontinuity Design on the continuous fraud score around the cutoff threshold of 0.65. Pros: utilizes a natural experiment, ensures local randomness, and isolates the causal effect specifically for "borderline" transactions. Cons: requires a large volume of data within the threshold window; estimates LATE, which may differ from ATE for the entire population; sensitive to score manipulation (fraudsters may learn to bypass the threshold).

Option D: Synthetic Control Method, creating a weighted combination of historical cohorts to simulate a control group. Pros: works without a physical control group, accounts for temporal trends. Cons: assumes that influencing factors are stable over time; sensitive to outliers in preprocessing; hard to validate except via placebo tests.

Option C (RDD) was chosen with a bandwidth of 0.08 and a first-degree polynomial. Analysis showed that for transactions over 15,000 ₽, the false positive rate was twice as high as for small purchases. Based on this, dynamic thresholds by product categories were adjusted.

Result: We were able to quantitatively assess that 0.6 percentage points of the 0.7 conversion loss were attributed to false positives. After recalibrating thresholds, 45% of lost revenue (≈18 million ₽ per month) was recovered while maintaining 90% effectiveness against fraud.

Common Oversights by Candidates

How to distinguish causal effect from selection bias when high fraud score users initially have a lower propensity to purchase, even if fraud systems did not exist?

Answer: This is a classic case of confounding by indication, where the indication for treatment (high risk) correlates with the outcome. In RDD, it is critical to check for covariate balance in the bandwidth window: to compare the distribution of device age, purchase history, and geo between groups just below and just above the threshold. If an imbalance is observed, it is necessary to apply bias-corrected RDD by including covariates in the regression or use Local Randomization approaches, formally testing the hypothesis of the randomness of distribution. Without this check, the effect estimate will be mixed with pre-existing differences between high- and low-risk users.

Why does a simple comparison of approval rates between users who passed through different model versions (v1 and v2) not allow for a correct assessment of the algorithm's improvement effect?

Answer: This comparison suffers from selection bias due to unobservable factors and compositional drift. The new model v2 may be applied selectively (e.g., only to new users or in pilot regions), creating incomparable groups. Moreover, an improvement in scoring quality changes the composition of approved users: v2 may approve a "gray zone" that v1 rejected, but these users have a different conversion rate. For a correct assessment, it is necessary to use Offline Policy Evaluation with Inverse Propensity Weighting (IPW) or Doubly Robust Estimation on historical logs, estimating the counterfactual revenue that v1 would have generated on the same transactions as v2.

How to account for the delayed feedback problem, when fraud is confirmed after 30 days (chargeback), while analysts need an effect estimate within 7 days for operational decisions?

Answer: This creates a problem of censored data and asymmetry in assessments. For transactions from the last 30 days, we do not know the true label (fraud/not fraud). The solution is to use Survival Analysis (Cox proportional hazards model) to estimate time-to-fraud, allowing handling incomplete data. Alternatively, Surrogate Metrics (e.g., velocity features, change of device fingerprint during a session) that correlate with future fraud can be used as proxies. It is important to understand that false positives are visible immediately (instant denial), while false negatives appear with delay, which biases precision upwards over a short horizon. For RDD, it is recommended to use "frozen" data with a lag of 30+ days, accepting the loss of freshness for the sake of accuracy in causal inference.