Answer to the question

History of the question

The Nested Set Model was popularized by Joe Celko in the 1990s as a method to represent tree structures in SQL without recursive joins. By assigning each node a left (lft) and right (rgt) integer boundary, the model allows entire subtrees to be selected via simple range queries. However, because the standard does not enforce interval integrity constraints, concurrent bulk inserts or legacy migration errors frequently introduce corruptions where intervals overlap partially, destroying the hierarchical semantics. This question arises in data warehousing scenarios where hierarchies must be validated before powering OLAP cubes or recommendation engines.

The problem

In a valid nested set, any two nodes must be either disjoint (a.rgt < b.lft) or in a strict containment relationship (a.lft < b.lft AND a.rgt > b.rgt). A partial overlap—where a.lft < b.lft but a.rgt falls between b.lft and b.rgt—creates a logical impossibility where a node appears to be both a sibling and a descendant, breaking subtree queries. Detecting these violations requires comparing every pair of intervals to find intersections that lack proper envelopment, which is computationally expensive if done procedurally.

The solution

We employ a self-join using interval algebra to identify crossings without containment. The predicate detects when node a starts before node b but ends inside b’s range, indicating a partial overlap.

SELECT a.id AS violating_node_a, 
       b.id AS violating_node_b,
       a.lft AS a_left, a.rgt AS a_right,
       b.lft AS b_left, b.rgt AS b_right
FROM nested_set a
JOIN nested_set b 
  ON a.lft < b.lft           -- a starts before b
  AND a.rgt > b.lft          -- a ends after b starts (intersection)
  AND a.rgt < b.rgt          -- but a ends before b ends (no containment)
WHERE a.id < b.id;           -- Avoid symmetric duplicates

This query returns all pairs of nodes that intersect illegally. To make it runnable in read-heavy production environments, composite indexes on (lft, rgt) and (rgt, lft) enable index-only scans, reducing complexity from O(n²) full table scans to O(n log n) range lookups.

Situation from life

During a zero-downtime migration of a retail product taxonomy from a legacy IBM DB2 system into a PostgreSQL data warehouse, the engineering team chose the Nested Set Model to support fast category roll-up queries for the analytics dashboard. The ETL pipeline calculated lft and rgt values using batched window functions, but race conditions in the legacy system’s export API produced off-by-one errors, resulting in 147 overlapping category intervals. These overlaps caused products to be double-counted in revenue reports, inflating projected sales by 12%.

Three remediation strategies were evaluated.

Strategy 1: Procedural validation using a cursor. A PL/pgSQL function iterated through every node, recursively checking for overlaps by comparing each node against all others with higher lft values. While conceptually simple, this approach acquired row-level locks and took 38 minutes to complete on 2.3 million categories, violating the strict five-minute freshness SLA for inventory updates. It was rejected due to unacceptable concurrency degradation.

Strategy 2: Reconstruction via recursive CTE. We considered abandoning the nested set model entirely and rebuilding the tree from an adjacency list using a recursive CTE to generate new, correct intervals. This would fix the corruption but required a full table rewrite and temporary suspension of the catalog API, effectively taking the product search offline. It also treated the symptom rather than identifying the specific corrupt records for audit purposes.

Strategy 3: Set-based interval algebra (ANSI SQL). We implemented the self-join query using strictly standard SQL predicates. By leveraging B-tree indexes on the interval columns, the validation completed in 4.2 seconds, identifying exactly which 147 node pairs violated the containment rules. This allowed us to quarantine only the affected subcategories for manual correction while keeping the rest of the catalog live.

We chose Strategy 3 because it provided surgical precision without blocking writers or requiring downtime. The result was a clean taxonomy where interval boundaries formed strict supersets, restoring referential integrity and ensuring that subsequent ACID-compliant updates could not introduce new overlaps due to added database constraints.

What candidates often miss

How do you distinguish between a valid parent-child containment relationship and an invalid partial overlap when writing the join predicate?

Candidates often conflate intersection with containment. They write a.lft < b.lft AND a.rgt > b.lft (which only checks for intersection) and mistakenly believe this detects violations. The critical detail is the exclusivity of the endpoint: for strict containment, the parent’s right boundary must exceed the child’s right boundary (a.rgt > b.rgt). A partial overlap occurs specifically when a.lft < b.lft AND a.rgt > b.lft AND a.rgt < b.rgt. Beginners frequently miss the third condition, causing the query to return false positives for valid parent-child pairs. Additionally, they forget to exclude self-pairs (a.id != b.id) or handle symmetric duplicates by enforcing a.id < b.id, resulting in redundant violation reports.

Why might a node appear to have no overlaps yet still be orphaned from the root, and how do you detect this using only set operations?

An orphaned node exists when no single row contains its entire interval (lft, rgt), meaning it has no path to the root. Candidates often try to solve this with a LEFT JOIN looking for NULL parents, but this fails to distinguish the legitimate root node (which should have no parent) from true orphans. The correct approach uses NOT EXISTS with an exclusion for the global root:

SELECT c.id
FROM nested_set c
WHERE NOT EXISTS (
  SELECT 1 FROM nested_set p
  WHERE p.lft < c.lft AND p.rgt > c.rgt
)
AND c.lft != (SELECT MIN(lft) FROM nested_set);

The edge case beginners miss is the multi-root scenario: if the table accidentally contains two separate trees, the node with the second-smallest lft might appear to have no parent if we only check for lft minimum. The query must explicitly identify the single root (minimum lft) and exclude it; otherwise, it falsely flags the secondary root as an orphan.

How would you detect a multi-parent violation—where a node is contained by two separate ancestors that are not hierarchically related—using strictly ANSI SQL without window functions?

This is distinct from overlap detection because the two ancestors are disjoint (valid siblings), yet both improperly contain the same child node. This violates the tree property (single parent) but does not create an interval overlap between the ancestors. Candidates often try to GROUP BY child_id HAVING COUNT(*) > 1 on all ancestors, but this fails because a valid deep node naturally has many ancestors (grandparents, etc.). The solution requires identifying the immediate parent: the ancestor with the maximum lft value that is still less than the child’s lft.

SELECT c.id AS child_id
FROM nested_set c
JOIN nested_set p ON p.lft < c.lft AND p.rgt > c.rgt
WHERE NOT EXISTS (
  SELECT 1 FROM nested_set p2
  WHERE p2.lft > p.lft AND p2.rgt < p.rgt
    AND p2.lft < c.lft AND p2.rgt > c.rgt
)
GROUP BY c.id
HAVING COUNT(*) > 1;

The subquery filters for immediate parents by ensuring no intermediate node exists between the candidate and the child. Beginners miss this "closest ancestor" logic, instead counting all containers and incorrectly flagging every deep node as a violation.

When auditing a Nested Set Model hierarchy for data corruption, how do you detect invalid interval overlaps—where two nodes partially intersect without hierarchical containment—using strictly ANSI SQL set predicates without procedural iteration?

Answer to the question

Situation from life

What candidates often miss