ProgrammingData Engineer

How to organize bulk insert of data from a file into an SQL table to ensure maximum performance and guarantee data correctness? What tools should be used in different DBMS and what nuances of error control are there?

Pass interviews with Hintsage AI assistant

Answer.

To import large volumes of data in bulk, specialized commands and utilities are used: BULK INSERT, COPY, LOAD DATA INFILE, external tools like bcp (SQL Server), psql (PostgreSQL), as well as ETL utilities.

Key points:

  • Use CSV/TXT formats without unnecessary transformations.
  • Disable triggers and indexes during the import if data verification is not required — this increases speed.
  • Use checks and validation of referential integrity BEFORE and/or AFTER import.
  • Import into a temporary table, validate, then perform bulk insert into the main one.
  • Try to divide data into batches if supported.
  • Check return codes/logs — bulk insert may allow skipping failures.

Example for PostgreSQL:

COPY staging_table (id, name, age) FROM '/path/to/data.csv' DELIMITER ',' CSV HEADER; -- Data verification, transfer to production table after validation INSERT INTO prod_table (id, name, age) SELECT id, name, age FROM staging_table WHERE age >= 0 AND name IS NOT NULL;

Trick question.

Question: Why might there be a sharp drop in performance of subsequent operations after a bulk insert into a large indexed table?

Answer: Because a bulk insert fills the table directly, while indexes are rebuilt/updated only after the main import, which can lock the table and consume resources. It is recommended to disable secondary indexes during the import and recreate them afterwards, or to split into batches.


Story

In a logistics project, millions of rows were loaded through BULK INSERT without a temporary table — invalid data "clogged" the indexes with invalid information, after which it was not possible to simply "rollback" part of the bad rows due to FK and check constraints. The data had to be cleaned manually.


Story

In a corporate service, bulk insert increased import time by 10 times because the secondary indexes were not disabled during the upload, and their structure was recalculated at each step.


Story

In a fintech product, when loading large files, bulk insert failed to load all rows due to silent errors, as error codes were not handled correctly — part of the important information was lost and only discovered after comparison with external sources several days later.