Answer to the question

Establishing a single source of truth in post-merger scenarios requires a Domain-Driven Design approach to data governance rather than immediate physical consolidation. Implement a federated Master Data Management (MDM) architecture using an event-driven replication strategy, where Change Data Capture (CDC) mechanisms stream modifications from each subsidiary's CRM into a central Apache Kafka cluster. This creates a "golden record" repository through incremental convergence, allowing legacy systems to remain operational while the canonical model matures.

Deploy a Strangler Fig Pattern via an API Gateway that intercepts customer data requests, routing reads to the emerging MDM hub while gradually migrating writes. This approach satisfies the six-month regulatory deadline by providing immediate reporting capabilities from the hub, while the board's zero-downtime constraint is met through asynchronous synchronization that never freezes source databases.

Situation from life

Context. A private equity firm acquired five regional logistics companies to form a national carrier, each operating distinct CRM platforms. The Western division used heavily customized Salesforce, the Midwest ran legacy Microsoft Dynamics 365 with proprietary plugins, the Southeast utilized SAP Sales Cloud, the Northeast depended on a custom Ruby on Rails application backed by MySQL, and the Southwest operated Zoho CRM with complex Zoho Creator extensions. Regulatory authorities mandated unified Customer Due Diligence (CDD) reporting for Anti-Money Laundering (AML) compliance within 180 days, while the board explicitly prohibited any operational downtime that would breach existing 99.9% uptime SLAs with Fortune 500 clients.

Problem. No common unique identifier existed across the five ecosystems; Salesforce used 18-character IDs, Dynamics employed GUIDs, and the custom Rails app relied on auto-incrementing integers. Data quality varied drastically, with some subsidiaries storing addresses as unstructured text while others maintained normalized schemas. A traditional extract-transform-load (ETL) batch migration would require freezing data during cutover, which was impossible given the 24/7 dispatch operations and contractual penalties for service interruptions.

Solution 1: Big Bang Migration. This strategy proposed a comprehensive single weekend cutover where all five legacy systems would simultaneously export their customer datasets to a central Snowflake data warehouse. During this window, complex transformation logic would standardize schemas and deduplicate records before synchronizing the cleansed data to a new unified Salesforce instance. This approach promised immediate technical debt elimination but required complete system freezing during the migration window.

Pros: Immediate elimination of technical debt; simplified long-term maintenance; single vendor relationship for support.

Cons: Simultaneous risk exposure across all five revenue streams; catastrophic rollback complexity if synchronization failed; direct violation of the board's non-negotiable zero-downtime constraint; potential data loss if the 48-hour window proved insufficient for the 2+ million record datasets.

Verdict: Rejected due to unacceptable business continuity risks.

Solution 2: Virtual Data Federation Layer. This alternative proposed implementing middleware using Denodo or TIBCO Data Virtualization to create a real-time abstraction layer that aggregates data without physical consolidation. The virtualization layer would present unified views to reporting tools while keeping actual data in the source CRM platforms, effectively creating a logical data warehouse. While this avoids data movement, it relies entirely on network stability and source system availability for every query.

Pros: Zero operational disruption to existing user workflows; immediate compliance reporting capability; no retraining required for subsidiary staff.

Cons: Severe query performance degradation during peak morning dispatch periods due to cross-system joins; network latency between regions causing reporting timeouts; failure to resolve underlying data quality issues or duplicate customer records; creation of permanent technical debt rather than architectural resolution.

Verdict: Rejected as a permanent solution, though retained as a temporary compliance bridge for the first 90 days.

Solution 3: Incremental Domain-Based Consolidation with Event Sourcing. This hybrid approach establishes a central MDM hub using Informatica MDM, deploying CDC agents such as Debezium for MySQL and native streaming APIs for Salesforce and Dynamics. These agents stream all data modifications into an Apache Kafka cluster where Apache Spark MLlib performs probabilistic matching to identify duplicates across subsidiaries and create survivor records. The architecture uses an AWS DMS (Database Migration Service) write-behind pattern to maintain legacy system compatibility while slowly migrating business processes to consume from the golden record API.

Pros: Risk isolation by migrating one subsidiary at a time; 100% uptime maintenance through asynchronous synchronization; parallel run capability for validation; regulatory compliance achieved through the hub while operational independence persists.

Cons: Higher initial infrastructure costs; temporary complexity of maintaining dual systems; potential bidirectional synchronization conflicts requiring manual intervention.

Chosen Solution and Rationale. We selected Solution 3 because it uniquely balanced the aggressive regulatory deadline with the non-negotiable operational constraints. We prioritized the two largest subsidiaries for the first phase, leveraging Kafka's log compaction feature to maintain immutable event histories that allowed operations teams to replay any synchronization errors without data loss. The MDM hub became the system of record for all new customer registrations, while AWS DMS propagated these changes back to legacy interfaces, ensuring users could continue with familiar workflows while data converged underneath.

Result. The consolidation completed in five months with zero unplanned downtime across any subsidiary. AML compliance reports generated exclusively from the MDM hub passed the regulatory audit without exception. Duplicate customer records decreased by 73% through the matching algorithms, and cross-selling revenue increased 18% within the first quarter post-completion due to finally unified customer visibility.

What candidates often miss

How do you resolve conflicting data ownership when two subsidiaries assert different credit limits for the same customer, with both values being legally valid under their respective regional contracts?

This scenario tests understanding of bi-temporal data modeling and contextualized golden records. Rather than forcing a single value through destructive consolidation, the MDM must implement Multi-Valued Attributes that preserve both credit limits with validity periods and legal entity context. The solution requires establishing a Data Governance Committee with representatives from each subsidiary to define precedence rules—such as "most restrictive limit applies for enterprise risk assessment"—while maintaining both original values for subsidiary-specific reporting. Technically, this involves adding jurisdiction and contractual-validity metadata fields to the canonical model, ensuring the system can render both the enterprise view (conservative risk exposure) and the subsidiary view (contractual obligations) without data loss.

What strategy ensures referential integrity when consolidating relational databases with foreign key constraints into an eventually consistent event-driven architecture using Apache Kafka?

Candidates frequently neglect transaction boundary analysis and the Saga pattern. When a business operation spans multiple subsidiaries—such as updating a customer's corporate hierarchy that exists partially in Salesforce and partially in SAP—the BA must design compensating transactions. If the Salesforce update succeeds but the SAP update fails, the system must issue a compensating rollback event to maintain consistency. This requires implementing Saga orchestrators within the MDM hub that manage distributed transactions across Kafka topics. Additionally, incorporating vector clocks or Lamport timestamps in the event schema allows detection of causality violations when subsidiaries simultaneously update the same entity, enabling conflict resolution based on business rules (such as "last timestamp wins" or "subsidiary with highest revenue volume wins").

Explain how you validate data accuracy during parallel run periods without doubling the manual verification workload for business users who must confirm records in both legacy CRM systems and the new MDM hub.

This addresses the Verification Paradox inherent in zero-downtime migrations. The solution involves synthetic transaction monitoring and statistical data fingerprinting rather than manual reconciliation. Implement automated checksum comparisons using frameworks like Great Expectations or Deequ to generate statistical profiles of data distributions in both source and target systems. For critical fields such as tax identification numbers, deploy deterministic matching with automated exception reporting. The BA should define tolerance thresholds—accepting a 99.5% match rate for non-critical fields while requiring 100% accuracy for financial identifiers—and implement data quality dashboards in Tableau or Power BI that highlight anomalies in real-time, allowing users to focus only on significant discrepancies.