History: Traditional e-commerce platforms relied on monolithic RDBMS instances with pessimistic locking, which collapsed under flash-sale loads exceeding 100,000 concurrent checkouts per second. The industry shifted toward CQRS and Event Sourcing patterns to decouple read and write paths, but this introduced complexity in maintaining inventory accuracy across distributed WMS and legacy ERP silos with varying latency characteristics.
Problem: The core challenge lies in satisfying the CAP theorem constraints during network partitions while preventing overselling—a strict business invariant. Distributed locking mechanisms like RedLock introduce latency and availability risks, while purely eventual consistent models risk selling phantom inventory. Additionally, heterogeneous integration points with legacy SOAP/XML-based WMS systems create impedance mismatch and timeout cascades that complicate atomic transaction boundaries.
Solution: Implement an Event Store (e.g., Apache Kafka or EventStoreDB) as the source of truth for inventory deltas, using optimistic concurrency control with vector clocks to establish causal ordering without global locks. Employ Saga orchestration (using Temporal or Camunda) to manage cross-WMS transactions, where local reservations are immediately committed to the event store, and asynchronous confirmation from WMS triggers final allocation or compensating releases. For read scalability, deploy CQRS with CDC via Debezium projecting into Redis or Elasticsearch, ensuring sub-50ms read latency while accepting temporary staleness mitigated by reservation TTLs.
During Black Friday 2022 preparation, a global fashion retailer experienced catastrophic database timeouts when 50,000 concurrent users targeted limited-edition sneaker releases. Their existing MySQL master-slave topology suffered severe write contention on hot inventory rows, resulting in 12-second checkout latencies and 300 confirmed overselling incidents caused by replication lag between the primary and read replicas. The business required a solution that could absorb flash-sale traffic tsunamis while maintaining the strict invariant that sellable units never exceeded physical warehouse stock.
The engineering team initially proposed implementing Redis RedLock algorithms across three availability zones to enforce distributed mutual exclusion during inventory decrements. This approach offered the advantage of strong consistency guarantees familiar to the team and straightforward integration with existing Redis clusters already used for session management. However, critical drawbacks included unacceptable latency spikes exceeding 500ms during availability zone failures and the theoretical risk of clock skew invalidating lock safety properties, which could potentially deadlock inventory allocation during the most critical revenue-generating windows.
An alternative strategy involved horizontally sharding the database by SKU ranges and employing Two-Phase Commit protocols to maintain ACID guarantees across regional PostgreSQL instances. This solution provided the benefit of strong consistency and immediate inventory accuracy without complex eventual consistency patterns, fitting traditional transactional mindsets. Nevertheless, the cons proved prohibitive: the blocking nature of 2PC meant coordinator failures could hold database locks indefinitely, and the protocol's message complexity created network saturation during traffic peaks, fundamentally violating the availability requirements necessary for 24/7 global commerce.
The final candidate architecture embraced Event Sourcing with Apache Kafka and Saga orchestration, accepting BASE semantics while enforcing business invariants through compensating transactions. Pros included inherent horizontal scalability through partitionable event streams, immutable audit trails critical for fraud analysis, and natural integration with heterogeneous legacy WMS via idempotent event consumers. The primary cons involved a steep learning curve for developers unfamiliar with immutable data patterns and the operational complexity of managing event schema evolution and replay strategies for new read model projections.
The architecture committee selected the Event Sourcing approach because flash sales fundamentally prioritize availability and partition tolerance over immediate consistency, and the business logic could accommodate temporary soft reservations with five-minute TTLs rather than hard database locks. Unlike the locking alternatives, this design allowed the system to remain available during network partitions between data centers, ensuring customers could always attempt purchases even if warehouse confirmations experienced latency. Additionally, the immutable event log provided the auditability required by finance teams to reconcile discrepancies with third-party logistics providers.
The implementation deployed Kafka Streams for local inventory aggregate management, Temporal for saga orchestration across SAP and custom WMS systems, and Redis with write-through caching for query optimization. During the subsequent Cyber Monday event, the platform successfully processed 120,000 concurrent checkouts with p99 latency under 80ms and zero overselling incidents, maintaining 99.99% availability despite a simulated regional outage in us-east-1 that would have crippled the previous monolithic architecture.
How do you prevent phantom inventory when processing concurrent reservations across multiple Kafka partitions without using global locks?
Phantom inventory occurs when concurrent commands read stale stock levels from different partitions and both commit reservations exceeding actual availability. To prevent this without global locks, implement optimistic concurrency control using version numbers within the Event Store; each inventory aggregate maintains a monotonic counter, and commands include expected versions, with the store rejecting appends if versions mismatch, forcing client retries. Additionally, ensure partition affinity by hashing SKUs to specific partitions, maintaining single-writer semantics per aggregate and eliminating cross-partition coordination entirely.
What is the compensating transaction strategy when a legacy WMS SOAP endpoint times out after the local event store has already committed the stock decrement?
This scenario represents a partial failure where local state diverges from external reality, requiring the Saga pattern with backward recovery. When the WMS adapter encounters a timeout, it publishes a timeout event to the saga orchestrator, which then appends a Stock_Released event to return inventory to the available pool while maintaining idempotency keys to prevent double-allocation on retries. Crucially, never delete the original decrement event; instead append compensating events to preserve the complete temporal history and audit trail of the transaction attempt.
How do you handle event replay and read model rebuilds without exposing inconsistent inventory counts to customers during the reconstruction process?
Replaying events to rebuild CQRS read models risks presenting transiently incorrect stock levels if projections update incrementally while queries execute. The solution employs blue-green deployment for read models: create a shadow projection consuming the event log from offset zero while the existing instance serves traffic, then atomically switch routing when the shadow catches up. Additionally, utilize Kafka log compaction and periodic S3 snapshots to reduce replay time, ensuring new projections minimize the window of potential inconsistency during reconstruction.