ProgrammingData Engineer

How to implement efficient processing and storage of hierarchical (tree-like) data in SQL? What methods exist for this, and how to choose the appropriate one for the task?

Pass interviews with Hintsage AI assistant

Answer.

Working with hierarchical structures is a classic case for relational databases. Examples include: directories, tree menus, organizational structures.

Background: Databases use a tabular model. There are several approaches for tree-like data, each with its advantages and disadvantages.

Problem: The standard parent_id model leads to slow recursive SELECTs. Real tasks (finding all descendants, paths, subtree counting) require optimization.

Solutions:

  • Adjacency List — a simple reference to parent_id.
  • Materialized Path — a string representing the full path.
  • Nested Sets — storing left/right boundaries (left/right) which allows for easy retrieval of subtrees.
  • Closure Table — a separate table of relationships (from->to) reflecting all the relationships in the tree.

Example of code for Materialized Path (PostgreSQL):

CREATE TABLE categories ( id SERIAL PRIMARY KEY, name TEXT, path TEXT ); -- Example of storing path: '1/2/5/' (root/subcategory/current) SELECT * FROM categories WHERE path LIKE '1/2/%'; -- all descendants of 2

Example of code for Nested Sets:

CREATE TABLE nested_categories ( id SERIAL PRIMARY KEY, name TEXT, lft INT NOT NULL, rgt INT NOT NULL ); -- Child nodes SELECT * FROM nested_categories WHERE lft > 2 AND rgt < 15;

Key features:

  • Adjacency List is convenient for simple tree structures (low nesting).
  • Materialized Path — fast subtree retrieval, easy to implement.
  • Nested Sets — instant access to all descendants, fast reads, but complex modification.

Trick questions.

Can we simply use a nested SELECT with parent_id to find all descendants to any depth?

This works only for shallow depths. For recursive searching, either a recursive CTE (WITH RECURSIVE) or more complex schemes are required, as simple JOINs lead to an enormous number of requests and poor performance.

Example:

WITH RECURSIVE cte AS ( SELECT id, parent_id, name FROM categories WHERE id = 1 UNION ALL SELECT c.id, c.parent_id, c.name FROM categories c INNER JOIN cte ON c.parent_id = cte.id ) SELECT * FROM cte;

Why does the molecular tree (Nested Sets) quickly extract a subtree, but slowly add/remove nodes?

When modifying the tree, the lft/rgt values must be changed for many rows (the change step — all values greater than the insert/delete). The approach is ideal for reading, but for frequent changes, consider other methods.

When to use a Closure Table, instead of parent_id or materialized path?

Closure Table works perfectly for multiple requests for descendants at any level and regular relationship counting. The downside is that it requires more space.

Example:

CREATE TABLE closure ( ancestor INT, descendant INT, depth INT );

Typical mistakes and anti-patterns

  • Storing hierarchy only through parent_id, when fast searching for nested structures is needed.
  • Manually recalculating paths or lft/rgt without auxiliary functions.
  • Lack of indexing for key columns (parent_id/path/lft/rgt).

Real-life example

Negative case

In an online store, categories are implemented through parent_id. All nesting is done manually, searching for subcategories involves nested SELECTs.

Pros:

  • Simplicity, no advanced support required.

Cons:

  • Performance drops even at 3-4 levels of nesting.
  • Node moving operations are non-intuitive, leading to logical errors.

Positive case

Uses materialized path or closure table. All queries for nested categories are instantaneous, recalculation is done through batch scripts.

Pros:

  • Scalability.
  • Fast nested selections.

Cons:

  • Requires additional planning.
  • Increases load during structural changes.