SQL (ANSI)ProgrammingData Engineer

When establishing robust outlier thresholds for sensor telemetry, how would you calculate the **median absolute deviation (MAD)** per category using strictly **ANSI SQL** window functions without correlated subqueries?

Pass interviews with Hintsage AI assistant

Answer to the question

History of the question

The median absolute deviation (MAD) was introduced by Gauss in 1816 as a robust measure of statistical dispersion, later formalized by Hampel in the 1970s for outlier-resistant analytics. Unlike standard deviation, which squares deviations and is therefore hypersensitive to extreme values, MAD tolerates up to 50% contaminated data without distortion. In ANSI SQL, calculating MAD became practical with the SQL:2003 standard, which introduced ordered-set aggregate functions such as PERCENTILE_CONT, enabling declarative median calculations without procedural loops.

The problem

Computing MAD requires a nested median operation: first determine the median of the dataset, then find the median of the absolute differences between each observation and that median. In ANSI SQL, this is challenging because referencing an aggregate result within the same SELECT clause to calculate individual deviations necessitates a self-join or correlated subquery, both of which degrade performance on large time-series datasets. Furthermore, standard STDDEV functions produce inflated thresholds when sensor data contains transmission spikes or calibration errors, making the robust MAD essential for accurate anomaly detection.

The solution

Employ a Common Table Expression (CTE) pipeline to separate the computation into logical stages. First, use PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) OVER (PARTITION BY category) to calculate the median per group. Second, compute the absolute deviation for each row relative to its group median. Finally, apply PERCENTILE_CONT again to these deviations to derive the MAD. This method is purely set-based, leverages the database engine's optimizer for window functions, and avoids row-by-row processing.

WITH group_medians AS ( SELECT sensor_id, reading, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY reading) OVER (PARTITION BY sensor_id) AS median_val FROM telemetry ), deviations AS ( SELECT sensor_id, ABS(reading - median_val) AS abs_dev FROM group_medians ) SELECT DISTINCT sensor_id, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY abs_dev) OVER (PARTITION BY sensor_id) AS mad FROM deviations;

Situation from life

A manufacturing plant deployed thousands of vibration sensors on conveyor belts to predict bearing failures. Static alert thresholds failed because winter temperatures produced naturally lower baselines than summer, causing false positives in cold months and missed alerts in hot months. The engineering team needed a statistical method that adapted to each sensor's unique historical distribution without being skewed by occasional transmission glitches.

The team considered three architectural approaches.

Client-side statistical processing involved exporting daily CSV dumps to Python using Pandas and SciPy libraries. This offered rich statistical functions and rapid prototyping, but it introduced a 24-hour data latency and created security risks by moving sensitive operational data outside the SQL database firewall.

Procedural SQL solutions utilized cursors and temporary tables to iterate through each sensor's history, sorting values to identify the middle row. This approach worked on legacy systems lacking modern window functions, yet it suffered from severe performance degradation due to O(n²) complexity and excessive locking contention, taking over 45 minutes to process one million rows.

ANSI SQL window functions implemented via CTEs computed medians set-based using PERCENTILE_CONT. This solution executed entirely within the database engine in under 800 milliseconds against 50 million records, minimized network overhead, and leveraged optimizer parallelism, though it required SQL:2003 or newer compliance.

The team selected the ANSI SQL window function approach because it balanced real-time performance with strict data governance requirements that prohibited data export. The resulting MAD values established dynamic thresholds where any reading exceeding median ± 3 * MAD triggered immediate maintenance alerts. This reduced false positives by 94% and detected three impending bearing failures two days earlier than the previous static system.

What candidates often miss

Why is MAD preferred over standard deviation for anomaly detection in SQL-based telemetry systems?

Standard deviation calculates the square root of the average squared deviation from the mean, a metric that explodes when outliers exist because squaring amplifies large distances. In contrast, MAD uses the median, which is a breakdown-point-resistant estimator that ignores the magnitude of extreme outliers up to 50% of the data volume. For ANSI SQL implementations, this means that a single sensor malfunction sending a value of 9999 will grossly inflate STDDEV but leave MAD nearly unchanged, preventing false threshold inflation that masks future subtle anomalies.

How do PERCENTILE_CONT and PERCENTILE_DISC differ when calculating medians for discrete sensor readings, and which should you use for MAD?

PERCENTILE_CONT(0.5) performs linear interpolation between the two central values when the row count is even, returning a hypothetical value that might not exist in your table (e.g., averaging 20 and 30 to return 25). PERCENTILE_DISC(0.5) returns the smallest actual value from the dataset whose cumulative distribution is greater than or equal to 0.5. For MAD calculation on discrete integer sensor readings, PERCENTILE_DISC is often safer because it guarantees the threshold corresponds to a real observed measurement, avoiding fractional deviations that complicate interpretation.

Can MAD be calculated without CTEs using a single self-join, and what are the performance trade-offs?

Yes, but it is inefficient. You can self-join the table on sensor_id to compare every row against every other row to find the median, but this results in O(n²) complexity. Alternatively, using a derived subquery to calculate the median first, then joining back to calculate deviations, forces the database to materialize intermediate results or re-scan the table multiple times. CTEs allow the optimizer to treat the median calculation as a spool or worktable that is computed once and reused, typically resulting in a single sort operation and linear O(n log n) complexity. Candidates often forget that ANSI SQL optimizers can transform CTEs into internal worktables, making them more efficient than correlated subqueries in the SELECT list.