ProgrammingData Engineer

How to programmatically implement updating only changed data when synchronizing two tables in different databases?

Pass interviews with Hintsage AI assistant

Answer.

Since the emergence of data integration tasks between systems, the question of programmatic synchronization often arises: it is necessary to update only the rows that have actually changed, saving resources and minimizing locks. The standard approach in the past consisted of complete replacement or manual difference searching, which wasted extra operations and led to errors.

Problem: when synchronizing, it is necessary to detect and update only the changed data, not everything in general. This is important for performance, network traffic, and integrity, especially with large volumes or distributed databases.

Solution: typically, a comparison of the source and target tables is implemented using JOIN operators, differences are calculated via HASH or checksums, and only the "divergent" rows are updated using MERGE (or UPSERT) or special triggers.

Example code (for SQL Server):

MERGE target_table AS t USING source_table AS s ON t.id = s.id WHEN MATCHED AND (t.name <> s.name OR t.value <> s.value) THEN UPDATE SET t.name = s.name, t.value = s.value WHEN NOT MATCHED BY TARGET THEN INSERT (id, name, value) VALUES (s.id, s.name, s.value);

Key features:

  • Use of MERGE/UPSERT without re-updating unchanged rows
  • Ability to match not only by PK but also by other unique fields
  • Check only the columns that have actually changed values (via condition)

Trick Questions.

Does MERGE always guarantee the absence of race conditions during concurrent updates?

No, without additional transaction control, race conditions may occur. For maximum accuracy, use the appropriate transaction isolation level.

Can UPDATE ... JOIN be used instead of MERGE for complex synchronization?

Often yes, but this approach will not allow adding new rows to the target table. Only MERGE/UPSERT updates and adds.

UPDATE t SET t.name = s.name, t.value = s.value FROM target_table t JOIN source_table s ON t.id = s.id WHERE t.name <> s.name OR t.value <> s.value;

What to do if the synchronized tables have differing sets of columns?

It is recommended to explicitly list the matching fields and not try to update "always all" fields. If necessary, use mapping tables or transformations.

Typical Mistakes and Anti-Patterns

  • Updating all rows sequentially (even unchanged ones)
  • Lack of checks for the existence of new rows (inserts)
  • Lack of transaction control, leading to double updates

Real-Life Example

Negative Case

An ETL engineer “loads” data using a simple UPDATE of all records by key: a table with tens of millions of rows, the process takes hours and blocks the server. There is no separate insertion of new rows, fresh data is lost.

Pros:

  • Simple code

Cons:

  • Slow, does not save resources, risk of losing new data

Positive Case

A specialist calculates checksums of rows before loading, filters changed records, uses MERGE for fast synchronization and additional new row insertion. The process takes a few minutes without overload.

Pros:

  • High speed
  • Minimization of locks
  • Correctness of changing only really different data

Cons:

  • Requires some design (hashes, identifiers)