Since the emergence of data integration tasks between systems, the question of programmatic synchronization often arises: it is necessary to update only the rows that have actually changed, saving resources and minimizing locks. The standard approach in the past consisted of complete replacement or manual difference searching, which wasted extra operations and led to errors.
Problem: when synchronizing, it is necessary to detect and update only the changed data, not everything in general. This is important for performance, network traffic, and integrity, especially with large volumes or distributed databases.
Solution: typically, a comparison of the source and target tables is implemented using JOIN operators, differences are calculated via HASH or checksums, and only the "divergent" rows are updated using MERGE (or UPSERT) or special triggers.
Example code (for SQL Server):
MERGE target_table AS t USING source_table AS s ON t.id = s.id WHEN MATCHED AND (t.name <> s.name OR t.value <> s.value) THEN UPDATE SET t.name = s.name, t.value = s.value WHEN NOT MATCHED BY TARGET THEN INSERT (id, name, value) VALUES (s.id, s.name, s.value);
Key features:
Does MERGE always guarantee the absence of race conditions during concurrent updates?
No, without additional transaction control, race conditions may occur. For maximum accuracy, use the appropriate transaction isolation level.
Can UPDATE ... JOIN be used instead of MERGE for complex synchronization?
Often yes, but this approach will not allow adding new rows to the target table. Only MERGE/UPSERT updates and adds.
UPDATE t SET t.name = s.name, t.value = s.value FROM target_table t JOIN source_table s ON t.id = s.id WHERE t.name <> s.name OR t.value <> s.value;
What to do if the synchronized tables have differing sets of columns?
It is recommended to explicitly list the matching fields and not try to update "always all" fields. If necessary, use mapping tables or transformations.
An ETL engineer “loads” data using a simple UPDATE of all records by key: a table with tens of millions of rows, the process takes hours and blocks the server. There is no separate insertion of new rows, fresh data is lost.
Pros:
Cons:
A specialist calculates checksums of rows before loading, filters changed records, uses MERGE for fast synchronization and additional new row insertion. The process takes a few minutes without overload.
Pros:
Cons: