SQLProgrammingSenior SQL Developer

Which subtle interaction between SQL's three-valued logic and PostgreSQL's EXCLUDED pseudo-table prevents ON CONFLICT updates from detecting changes involving NULL values in multi-column unique constraints?

Pass interviews with Hintsage AI assistant

Answer to the question.

The EXCLUDED pseudo-table in PostgreSQL represents the row proposed for insertion during an ON CONFLICT operation. Historically, developers migrating from MySQL or Oracle environments often assume that direct equality comparisons (=) suffice for detecting value changes within upsert patterns. However, SQL standard three-valued logic dictates that NULL represents an unknown state, meaning NULL = NULL evaluates to NULL (unknown), not TRUE.

This creates a critical problem when the conflict resolution clause attempts to optimize updates by adding a WHERE clause such as WHERE EXCLUDED.phone != users.phone. If both the existing row and the proposed row contain NULL for the phone column, the comparison returns NULL, which fails the WHERE predicate. Consequently, the database skips the update even though the values might be genuinely different in the context of business logic, or it cannot distinguish between a NULL in the new data versus a NULL in the old data.

The solution involves utilizing the IS DISTINCT FROM operator, which treats NULL as a comparable value. By structuring the update clause with WHERE EXCLUDED.column IS DISTINCT FROM table.column, the comparison returns FALSE when both values are NULL (indicating no change) and TRUE when one is NULL and the other is not. This ensures deterministic behavior while preventing unnecessary writes.

INSERT INTO patient_records (clinic_id, external_id, phone, updated_at) VALUES (101, 'P-2024-001', NULL, NOW()) ON CONFLICT (clinic_id, external_id) DO UPDATE SET phone = COALESCE(EXCLUDED.phone, patient_records.phone), updated_at = EXCLUDED.updated_at WHERE EXCLUDED.phone IS DISTINCT FROM patient_records.phone;

Situation from life

A hospital network needed to synchronize daily patient intake data from 50 external clinics into a central PostgreSQL data warehouse. Each clinic exported CSV files where missing patient phone numbers appeared as empty strings, which the COPY command converted to NULL during ingestion. The existing Python ETL script used SQLAlchemy to execute bulk upserts with ON CONFLICT (clinic_id, external_id) DO UPDATE SET phone = EXCLUDED.phone.

The problem emerged when clinic staff reported that valid phone numbers entered directly into the central system were mysteriously disappearing after the nightly sync. Investigation revealed that when the external feed sent NULL (indicating an unknown phone), it overwrote existing valid numbers because the SET clause executed unconditionally. Adding a naive filter WHERE EXCLUDED.phone != patient_records.phone failed because when both were NULL, the comparison returned NULL (unknown), causing the update to be skipped incorrectly, and when the new value was NULL and old was not, the logic behaved inconsistently across different PostgreSQL minor versions.

Three solutions were evaluated.

The first approach used COALESCE exclusively in the SET clause: SET phone = COALESCE(EXCLUDED.phone, patient_records.phone). This prevented overwriting with NULL but forced an update on every conflict, triggering expensive B-Tree index rebuilds on the phone column and firing audit triggers that logged "no-op" changes as legitimate modifications. This increased WAL (Write-Ahead Log) traffic by 300%, threatening replication lag and saturating the disk I/O.

The second solution attempted explicit boolean logic to handle NULL: WHERE (EXCLUDED.phone != patient_records.phone) OR (EXCLUDED.phone IS NULL AND patient_records.phone IS NOT NULL). While logically correct, this verbose pattern required careful maintenance across 15 nullable columns and confused the query optimizer. The planner abandoned index scans in favor of sequential scans on the 20-million-row table, causing the ETL job to exceed its six-hour maintenance window.

The third solution implemented IS DISTINCT FROM for all nullable columns in the WHERE clause. This provided a concise, sargable predicate that correctly identified genuine data changes including NULL transitions. It allowed updates only when necessary, eliminating superfluous trigger executions and WAL generation while maintaining consistent query plans.

The team chose the third solution for critical contact fields and the first solution for non-critical metadata where overwrite protection mattered more than performance. The result was dramatic: the sync job duration dropped from 45 minutes to 12 minutes, replication lag stabilized below five seconds, and the "disappearing phone number" incidents ceased completely within the first week of deployment.

What candidates often miss

Why does WHERE EXCLUDED.column != table.column skip rows when both values are NULL, and how does this interact with PostgreSQL's update mechanism?

Many candidates assume that if two NULLs are not equal, the comparison should return TRUE and allow the update. However, SQL uses three-valued logic: NULL represents an unknown value. Any comparison with NULL (including NULL = NULL or NULL != NULL) results in NULL (unknown), not a boolean TRUE or FALSE. In PostgreSQL's WHERE clause, only rows evaluating to TRUE proceed; NULL is treated as FALSE. Thus, when comparing two NULL phone numbers, the result is NULL, the update is skipped, and the system incorrectly assumes no change is needed. IS DISTINCT FROM returns FALSE for NULL vs NULL, correctly indicating they are identical and skipping the update only when appropriate, while returning TRUE when one value is NULL and the other is not.

How does the order of columns in a multi-column unique constraint affect the performance of the ON CONFLICT resolution, and what happens if the conflict target doesn't match the index definition exactly?

Candidates frequently overlook that PostgreSQL requires the conflict target (the columns listed in ON CONFLICT (...)) to match a unique index definition precisely, including column order and any functional expressions. If a unique index exists on (clinic_id, external_id) but the query specifies ON CONFLICT (external_id, clinic_id), the planner may fail to infer the index, throwing a "there is no unique or exclusion constraint matching the ON CONFLICT specification" error. Even if it succeeds, mismatched column order can prevent the optimizer from using an index-only scan to locate the conflicting tuple, forcing a heap fetch and increasing I/O cost significantly.

What is the difference between using COALESCE(EXCLUDED.column, table.column) in the SET clause versus using WHERE EXCLUDED.column IS DISTINCT FROM table.column, particularly regarding trigger execution and row versioning?

Using COALESCE in the SET clause unconditionally writes a value to the row (either the new data or the preserved old data). This operation generates a new row version (CTID), writes to the WAL, and fires all BEFORE and AFTER triggers associated with the table, even if the final value remains identical to the previous state. This creates "noise" in audit tables and increases replication load. Conversely, the WHERE clause with IS DISTINCT FROM prevents the row modification entirely if no actual change occurred. No new tuple version is created, triggers do not fire, and WAL generation is avoided. This distinction is critical for high-throughput systems with audit logging or foreign key cascades, where "no-op" updates create significant overhead.