ProgrammingData Engineer

How to implement separate error handling and unsuccessful operations during bulk data import into SQL, so as not to lose information about problematic rows?

Pass interviews with Hintsage AI assistant

Answer.

The need to log erroneous rows during bulk imports arose almost immediately after the introduction of automated uploads of large volumes of data into databases. Traditionally, any violation of constraints (e.g., data type violations, constraints) led to the interruption of the entire import process even for a single erroneous row — resulting in both correct and incorrect records being lost.

The challenge is not only to load the maximum amount of correct data but also to maintain audit and detailed error descriptions for further manual processing, rather than just "rolling everything back".

Solution: For some DBMS (e.g., PostgreSQL, SQL Server with SSIS, Oracle with EXTERNAL TABLES utilities), there are mechanisms for loading with separation of bad rows. For a universal approach — staging tables and post-upload validation using OUTER JOIN and NOT EXISTS are used, as well as attached "log tables" where problematic rows are manually logged.

Example code:

-- 1. Load everything into a temporary staging table BULK INSERT staging_payments FROM 'payments.csv' WITH (...) -- 2. Filter correct entries INSERT INTO payments (id, amount, ...) SELECT id, amount, ... FROM staging_payments WHERE amount > 0 AND status IN ('approved', 'pending'); -- 3. Log erroneous rows into a separate table INSERT INTO import_errors (row_data, error_desc) SELECT *, 'Incorrect status or negative amount' FROM staging_payments WHERE amount <= 0 OR status NOT IN ('approved', 'pending');

Key features:

  • Allows retaining information about problematic rows (logged in a separate table).
  • Enables rapid bulk loading of "clean" data, with errors handled separately.
  • Can expand error type logging and store additional information.

Trick Questions.

Is it possible to handle everything with ON ERROR RESUME NEXT inside SQL?

The SQL standard does not contain such a construct for bulk operator imports. In most cases, error reporting is generated in external tools or in log tables, as described above.

Can a transaction be used to roll back only erroneous rows?

Rolling back a transaction rolls back all rows at once (atomicity). Partial saving is only possible when loading first into a staging table, after which successful rows are manually committed and unsuccessful rows logged.

Why is it inefficient to catch every INSERT error through TRY/CATCH in a loop for each row?

Sequential processing of millions of rows through individual INSERT + TRY/CATCH leads to catastrophic performance degradation compared to batch or BULK INSERT + post-load validation.

Common Errors and Anti-Patterns

  • Attempting bulk loading of data without prior verification or a staging table leads to rolling back the entire load upon encountering at least one error.
  • Running multiple individual INSERTs inside a loop with error handling severely slows down the loading and negatively impacts locking.
  • Logging errors only to an external file rather than to a table complicates auditing and automation for corrections.

Real-Life Example

Negative Case

A company was loading tens of thousands of rows via an INSERT script, and upon encountering an error due to a unique key, the entire batch was rolled back. This caused downtime and data loss.

Pros:

  • Safety (no incorrect data). Cons:
  • Time loss, inability for partial saving, manual correction of the entire file.

Positive Case

Using a staging table, error logging, and categorizing incoming data into "valid/invalid" allowed for faster imports, retaining and automatically analyzing failures.

Pros:

  • High performance, ability to conduct pre- and post-processing, transparent error reporting. Cons:
  • Requires the existence of log tables and additional queries to support the process.