Ensuring data integrity in this scenario requires implementing a Change Data Capture (CDC) mechanism combined with continuous reconciliation processes. You must establish a baseline data snapshot using checksum validation and hash comparisons to identify the current state before migration begins. During the transition, deploy Kafka Connect or Debezium to stream real-time changes from the legacy ERP database transaction logs into the new event-driven system.
Implement a Saga pattern for distributed transaction management to handle failures gracefully without corrupting data across services. Finally, run parallel ETL jobs using Apache Spark or Databricks to perform nightly reconciliation between source and target systems, generating discrepancy reports for manual review until confidence reaches 99.99%.
I worked with a global retail chain migrating their inventory management from a 15-year-old Oracle ERP monolith to a microservices ecosystem using Apache Kafka and PostgreSQL. The ERP system had been modified by multiple vendors over the years, resulting in orphaned records and missing audit trails for approximately 30% of historical stock movements. The business operated 24/7 across time zones, meaning any downtime would cost $2M per hour in lost sales.
The data integrity challenge was severe because stock levels needed to remain accurate to prevent overselling, yet we couldn't pause operations to perform a clean cutover.
Solution 1: Implement Debezium CDC with real-time streaming
This approach involved configuring Debezium connectors to monitor Oracle LogMiner and capture every insert, update, and delete operation as events into Kafka topics. The pros included near real-time synchronization with sub-second latency and minimal impact on the legacy database performance. However, the cons were significant: CDC couldn't reconcile the existing data gaps from missing historical audits, and schema changes in the legacy system required constant connector reconfiguration, creating maintenance overhead.
Solution 2: Deploy a Strangler Fig pattern with API interception
We considered building an abstraction layer using GraphQL federation that would write to both the old ERP and new microservices simultaneously, gradually migrating read traffic. The pros included the ability to validate new system accuracy against the old system in production and instant rollback capability if discrepancies appeared. The cons included doubled infrastructure costs, increased latency for write operations, and the complexity of maintaining data consistency across two different storage models (relational vs event sourcing).
Solution 3: Create a bulk ETL approach with maintenance windows
This traditional method proposed using Apache Airflow to schedule large batch transfers during low-traffic hours, performing full table comparisons with MD5 hashes. The pros included thorough validation of every record and simpler error handling for bulk operations. The cons directly violated the zero-downtime requirement, as the ERP system needed read locks for consistent snapshots, potentially blocking inventory updates for 4-6 hours during peak reconciliation periods.
Chosen solution and reasoning
We selected a hybrid approach combining Solution 1 (Debezium CDC) for ongoing synchronization with a modified Solution 2 for historical backfill. We used Kafka Streams to process real-time changes while running Spark jobs during off-peak hours to backfill and validate the 30% of records with audit gaps. This choice balanced the need for continuous operation with the requirement for complete data accuracy, accepting the higher infrastructure cost as less expensive than potential downtime.
Result
The migration completed over six weeks with zero unplanned downtime. The reconciliation process identified and corrected 12,000 inventory discrepancies before they impacted customers. Prometheus dashboards tracked lag metrics, ensuring CDC latency remained under 500ms. After three months of parallel running with automated reconciliation showing 99.97% accuracy, we decommissioned the ERP module, saving the company $4M annually in licensing fees while maintaining inventory accuracy above 99.9%.
How do you handle schema evolution in event-driven architectures when events are immutable and downstream consumers depend on specific field structures?
Candidates often suggest simply updating the event schema, but this breaks the immutability principle fundamental to event sourcing. The correct approach involves implementing the Schema Registry pattern using Confluent Schema Registry or Apicurio. You must use schema versioning with backward and forward compatibility strategies: backward compatibility allows new consumers to read old events, while forward compatibility lets old consumers read new events. When breaking changes are unavoidable, you should implement the Event Upcasting pattern, where a separate translation layer transforms old event formats into the new domain model as they are read from the event store. This maintains the immutable audit trail while allowing the domain model to evolve, though it adds complexity to consumer logic and requires careful management of schema evolution policies.
What are the specific implications of the CAP Theorem on data consistency decisions during zero-downtime migrations, and how do you communicate trade-offs to non-technical stakeholders?
Many candidates mention CAP Theorem but fail to apply it practically to migration scenarios. During zero-downtime migrations, you cannot simultaneously guarantee Consistency, Availability, and Partition tolerance—you must choose two. In distributed migrations, you typically sacrifice immediate Consistency for Availability and Partition tolerance, implementing eventual consistency instead. To communicate this to business stakeholders, avoid technical terms like "CAP" or "ACID"; instead, explain that during the transition, different systems might briefly show different inventory counts, but they will align within minutes. Use concrete examples: "A customer might see an item available on the website but get an 'out of stock' message at checkout for approximately 30 seconds while systems synchronize." This sets realistic expectations about "consistency windows" and helps stakeholders understand why you need reconciliation processes rather than real-time perfection.
How do you calculate the acceptable financial cost of temporary data inconsistency versus the cost of delaying a migration deadline, and what metrics define the break-even point?
Candidates frequently miss the quantitative risk analysis aspect of migrations. You must calculate the Cost of Inconsistency (COI) by analyzing historical data for error rates and business impact: multiply the average daily transaction volume by the error probability by the average cost per error (including customer service time, refunds, and reputation damage). Compare this against the Cost of Delay (COD), which includes ongoing legacy licensing fees, missed market opportunities, and technical team morale/turnover costs. The break-even point occurs when COI × migration duration = COD × delay duration. For example, if data inconsistencies cost $5,000 daily and delay costs $50,000 daily, you can tolerate up to 10 days of reconciliation issues before delaying becomes more expensive. You should establish Service Level Objectives (SLOs) such as "reconciliation lag under 0.1% of records" and define automatic rollback triggers if error rates exceed historical baselines by more than 3 standard deviations.