ProgrammingBI Analyst

How to optimally implement conditional selections with filtering by a list of values (IN/NOT IN) and missing data (NULL) for analytics in large tables?

Pass interviews with Hintsage AI assistant

Answer.

Background of the question

It is often necessary to create queries where the filtering criteria are defined dynamically — the list of values is specified by the user of the interface, and missing values (NULL) should also be able to be included or excluded from the result. Such a query must work quickly on large volumes and be semantically correct from the SQL perspective.

Problem

Incorrectly combining the conditions IN, NOT IN, and checking for NULL leads to unexpected results because in SQL the expression "NULL IN (...)" always returns UNKNOWN, and "NOT IN" may lead to an empty result if at least one of the elements in the list is NULL. Performance on large data with poorly written conditions can drop sharply.

Solution

The correct combination looks like this:

  • If selecting values from the list, complement with a condition for NULL if they need to be returned:
SELECT * FROM sales WHERE region IN ('Moscow','Samara') OR region IS NULL;
  • For negation, use explicit logic:
SELECT * FROM sales WHERE (region NOT IN ('Moscow','Samara') OR region IS NULL);
  • For strict exclusion of missing values:
SELECT * FROM sales WHERE region NOT IN ('Moscow','Samara') AND region IS NOT NULL;
  • For large tables, indexing coverage is often relevant, especially if filtering by list is a common operation.

Key features:

  • Manage filtering by NULL explicitly, otherwise rows are lost.
  • Proper use of IN and NOT IN is critical for correct selections.
  • Indexing on fields involved in IN/NOT IN greatly speeds up filtering.

Trick questions.

What happens if you check "WHERE field NOT IN ('a', NULL)"?

The result is always empty, as any row compared to NULL in the list returns UNKNOWN and filters everything out.

Code example:

SELECT * FROM test WHERE name NOT IN ('Ivan', NULL);

Will "IN (NULL,...)" return a value for rows with NULL?

No. The expression NULL IN (...) always returns UNKNOWN, not TRUE.

How to speed up filtering by a large list of values?

Use indexes and preferably apply JOIN with a temporary table if the list is large (hundreds/thousands of values) — this speeds up filtering compared to a long IN.

Typical mistakes and anti-patterns

  • Writing only IN/NOT IN without an EXPLICIT condition for NULL
  • Not considering the peculiarities of executing NOT IN with NULL
  • Not optimizing indexing for the filtering field

Real-life example

Negative case

In an analytical report, "WHERE city NOT IN ('London', field_2, NULL)" was applied. The result — completely empty, no one could figure out the cause of lost rows for a long time.

Pros:

  • Short and "readable" query

Cons:

  • Data loss, non-obvious logic for developers

Positive case

In a CRM, a list of customer regions was selected and an additional button "Include empty values" was added. Filtering logic:

... WHERE city IN ('SPB','NNov') OR city IS NULL

Pros:

  • Flexible selection, correct accounting of data incompleteness
  • Works quickly due to indexing

Cons:

  • Logic needs to be tested with every model change