Answer to the question.

The architecture centers on the Durable Execution pattern, separating ephemeral compute from durable state through an event-sourced control plane. At its core, workflow definitions execute as deterministic state machines where every state transition is persisted as an immutable event in Apache Kafka (write-ahead log) before acknowledgment, enabling deterministic replay during failures. The compute layer utilizes AWS Lambda or Azure Functions organized into tenant-specific VPCs and IAM boundaries, ensuring isolation while leveraging provisioned concurrency warm pools to mitigate cold starts. For exactly-once semantics across regions, the system employs CockroachDB with serializable default isolation to store workflow state, utilizing Raft consensus for cross-region consistency without requiring a separate coordinator service. Event correlation achieves sub-second latency through a tiered approach: Redis clusters with RedisJSON indexing handle hot event matching in memory, while Elasticsearch serves as the cold storage for historical correlation queries, with Cloudflare Workers providing edge-side event buffering to absorb traffic spikes.

Situation from life

During Black Friday 2023, SwiftCart (a global e-commerce platform) faced catastrophic failures in their legacy Step Functions implementation while processing 50M concurrent delivery workflows lasting 3-7 days each. When us-east-1 experienced a regional outage, failover to us-west-2 resulted in 12,000 duplicated shipments because workflow state reconciliation relied on DynamoDB eventual consistency with 5-minute TTL windows. Simultaneously, carrier webhook events faced 30-second correlation delays, breaking real-time tracking promises to customers and incurring $2M in SLA penalties.

Solution A: Kubernetes-based orchestrator with Airflow on EKS

This approach promised full control and mature tooling through Apache Airflow running on Amazon EKS with PostgreSQL as the metadata store. Pros included extensive plugin ecosystems and straightforward local development environments. However, the cons proved fatal: pod scheduling latency averaged 45 seconds, violating the scale-to-zero requirement that idle workflows should incur near-zero compute costs. Additionally, maintaining PostgreSQL synchronous replication across regions added 200ms to every task state transition, and the lack of built-in exactly-once semantics required complex application-level locking that frequently deadlocked during regional failovers.

Solution B: Pure event-driven choreography using Kafka and Lambda

This serverless-native approach utilized Amazon MSK (Kafka) as the source of truth with Lambda functions reacting to events without a central orchestrator. Pros included true pay-per-use economics and natural resilience through log-based persistence. However, implementing exactly-once semantics required distributed transactions spanning DynamoDB (for idempotency) and Kafka, introducing 500ms+ latency per operation. Furthermore, reconstructing workflow state for long-running processes (day 5 of a 7-day workflow) necessitated replaying millions of events from S3 archives, causing recovery times exceeding 10 minutes and making debugging "distributed spaghetti" impossible when failures occurred mid-sequence.

Solution C: Durable Execution Platform with sharded state management

The chosen architecture implemented a custom Temporal-inspired control plane separating durable state (CockroachDB with geo-partitioned tables) from ephemeral Lambda workers. Consistent Hashing distributed workflow shards across regional database nodes, while Redis Streams provided sub-millisecond event correlation buffering. Pros included native exactly-once through CockroachDB's serializable transactions, deterministic replay for debugging, and true scale-to-zero where inactive workflows resided only in cheap S3 snapshots. Cons involved significant operational complexity in maintaining etcd clusters for service discovery and the need for sophisticated caching to prevent thundering herds during mass wake-up scenarios.

Result

By implementing Solution C with per-tenant SQS queues and 1-second visibility timeouts, SwiftCart achieved zero workflow duplication during the subsequent Prime Day event despite a 45-minute us-west-2 outage. Event correlation p95 latency dropped to 400ms through Redis edge caching. Infrastructure costs decreased by 70% compared to the always-on EKS approach, with 85% of workflows existing solely as compressed state snapshots in S3 during idle waiting periods, resulting in $1.4M annual savings.

What candidates often miss

How do you prevent workflow state divergence when both regions simultaneously process events during a network partition?

Most candidates incorrectly suggest last-write-wins semantics in DynamoDB or Cassandra, which fails for workflow orchestration because business operations are non-commutative (e.g., "cancel order" versus "ship order" cannot be reconciled by timestamp alone). The correct implementation utilizes Vector Clocks or Dotted Version Vectors embedded within the workflow state metadata. When the network partition heals, the system detects concurrent branches through version vector comparison and applies domain-specific merge functions. For irreconcilable conflicts (such as simultaneous cancellation and shipment), the architecture implements a saga compensation pattern where the later operation triggers a rollback of the earlier action with comprehensive audit logging. Alternatively, leveraging CockroachDB's default serializable isolation prevents divergence entirely by rejecting conflicting writes during the partition, forcing explicit retry loops with exponential backoff rather than allowing silent data corruption.

How do you handle workflow code versioning when a 7-day-long workflow started on v1.0 must complete after you've deployed v2.0 with altered activity semantics?

Candidates frequently overlook the Deterministic Replay requirement fundamental to durable execution. Simply updating the Lambda function code breaks in-flight workflows because the replay logic (used to reconstruct state after crashes) diverges from the original execution path, causing non-deterministic exceptions. The solution implements explicit Workflow Versioning through event sourcing markers. When deploying v2.0, workers must concurrently support both v1.0 and v2.0 activity implementations within WebAssembly sandboxes or separate Docker sidecars. The workflow state records which code version executed each historical activity; during replay, the worker loads the specific historical version's sandbox to ensure deterministic re-execution of past steps, while new workflows utilize v2.0. After the maximum workflow duration (7 days plus a 24-hour safety buffer), v1.0 code can be decommissioned. This requires maintaining backward-compatible activity signatures indefinitely or employing Pact Contract Testing to verify behavioral equivalence between versions.

How do you protect against "poison pill" workflows containing infinite loops or memory leaks in user code without breaking exactly-once guarantees for healthy workflows?

Simple Dead Letter Queues (DLQ) actually violate exactly-once semantics because moving a message to a DLQ requires acknowledging the original message, risking message loss if the DLQ write fails or the consumer crashes mid-operation. The robust solution employs Progress Tracking with idempotent checkpointing. Workers heartbeat every 30 seconds, writing progress tokens to etcd or CockroachDB using compare-and-swap operations. If a worker crashes three times consecutively on the same workflow task (detected via an execution attempt counter stored in the database), the task is flagged as "poisoned" but remains in the queue with an exponentially increasing visibility delay (1 minute, 5 minutes, 30 minutes). A separate "surgical" worker pool with enhanced observability, memory limits, and detailed OpenTelemetry tracing then attempts execution. Only after 24 hours of persistent failures does the workflow transition to a "suspended" state requiring manual operator intervention, preserving the exactly-once invariant throughout because all state transitions utilize MVCC timestamps in CockroachDB for atomic compare-and-swap operations.