The need to log erroneous rows during bulk imports arose almost immediately after the introduction of automated uploads of large volumes of data into databases. Traditionally, any violation of constraints (e.g., data type violations, constraints) led to the interruption of the entire import process even for a single erroneous row — resulting in both correct and incorrect records being lost.
The challenge is not only to load the maximum amount of correct data but also to maintain audit and detailed error descriptions for further manual processing, rather than just "rolling everything back".
Solution: For some DBMS (e.g., PostgreSQL, SQL Server with SSIS, Oracle with EXTERNAL TABLES utilities), there are mechanisms for loading with separation of bad rows. For a universal approach — staging tables and post-upload validation using OUTER JOIN and NOT EXISTS are used, as well as attached "log tables" where problematic rows are manually logged.
Example code:
-- 1. Load everything into a temporary staging table BULK INSERT staging_payments FROM 'payments.csv' WITH (...) -- 2. Filter correct entries INSERT INTO payments (id, amount, ...) SELECT id, amount, ... FROM staging_payments WHERE amount > 0 AND status IN ('approved', 'pending'); -- 3. Log erroneous rows into a separate table INSERT INTO import_errors (row_data, error_desc) SELECT *, 'Incorrect status or negative amount' FROM staging_payments WHERE amount <= 0 OR status NOT IN ('approved', 'pending');
Key features:
Is it possible to handle everything with ON ERROR RESUME NEXT inside SQL?
The SQL standard does not contain such a construct for bulk operator imports. In most cases, error reporting is generated in external tools or in log tables, as described above.
Can a transaction be used to roll back only erroneous rows?
Rolling back a transaction rolls back all rows at once (atomicity). Partial saving is only possible when loading first into a staging table, after which successful rows are manually committed and unsuccessful rows logged.
Why is it inefficient to catch every INSERT error through TRY/CATCH in a loop for each row?
Sequential processing of millions of rows through individual INSERT + TRY/CATCH leads to catastrophic performance degradation compared to batch or BULK INSERT + post-load validation.
A company was loading tens of thousands of rows via an INSERT script, and upon encountering an error due to a unique key, the entire batch was rolled back. This caused downtime and data loss.
Pros:
Using a staging table, error logging, and categorizing incoming data into "valid/invalid" allowed for faster imports, retaining and automatically analyzing failures.
Pros: