System ArchitectureSystem Architect

How would you design a fault-tolerant saga orchestration pattern for a distributed reservation system that compensates long-running transactions across independent subdomains while ensuring idempotency during duplicate request scenarios?

Pass interviews with Hintsage AI assistant

Answer to the question.

Implement a centralized Saga Orchestrator using Temporal or Netflix Conductor that maintains durable workflow state in PostgreSQL with gRPC communication to domain services. The pattern requires idempotency keys stored in Redis Cluster with TTL windows matching business constraints, while Apache Kafka serves as the event backbone for audit trails and compensation triggers. Each saga step must include compensating transactions that execute reverse operations using the Saga State Machine pattern, with explicit states (PENDING, SUCCEEDED, COMPENSATING, COMPENSATED) tracked in etcd or ZooKeeper for cluster coordination.

┌─────────────────┐     ┌──────────────┐     ┌─────────────────┐
│   API Gateway   │────▶│   Temporal   │────▶│   Inventory     │
└─────────────────┘     │  Orchestrator│     │   Service       │
                        └──────────────┘     └─────────────────┘
                               │                        │
                               ▼                        ▼
                        ┌──────────────┐          ┌─────────────────┐
                        │  PostgreSQL  │          │   PostgreSQL    │
                        │  State Store │          │   (Compensation │
                        └──────────────┘          │    Logic)       │
                                                  └─────────────────┘

Situation from life

A global hotel booking platform struggled with cascading failures when coordinating room reservations, payment processing, and loyalty point updates across three distinct Kubernetes clusters in different regions. Their legacy implementation used Two-Phase Commit (2PC) over REST APIs, causing widespread deadlocks during peak traffic when the payment gateway experienced latency spikes exceeding 10 seconds.

The team evaluated Choreography-Based Saga using Amazon EventBridge, where each service published domain events to a shared bus. This approach eliminated the single point of failure and reduced infrastructure costs by 40%. However, it introduced severe observability challenges, as determining whether a complex multi-room booking had succeeded required querying logs across seventeen microservices. The implicit dependencies made it impossible to enforce consistent timeout policies, and debugging production issues became a forensic exercise spanning multiple AWS CloudWatch dashboards.

They prototyped an Orchestrated Saga using a custom Node.js coordinator deployed on AWS ECS. This centralized the business logic and simplified monitoring through a unified Grafana dashboard. Unfortunately, the initial implementation stored workflow state only in memory, resulting in catastrophic data loss when the coordinator restarted during deployments. Thirty transactions entered unknown states, requiring manual database reconciliation that took three days and resulted in significant revenue loss from double-charged customers.

The chosen solution deployed Temporal as the workflow engine with Cassandra persistence, ensuring state durability across pod restarts and AZ failures. The architecture used Protobuf schemas for type-safe communication between the orchestrator and domain services, with Redis Sentinel managing idempotency keys. When the payment service experienced a regional outage in us-east-1, the saga automatically triggered compensation workflows that released room holds within 200ms and revoked loyalty points atomically.

The system now processes 50,000 complex bookings daily with 99.99% consistency guarantees and zero manual interventions during network partitions. The mean time to detect (MTTD) failures dropped from 45 minutes to 8 seconds, while compensation latency remains under 500ms at p99.

What candidates often miss

How do you handle partial compensation failure when a compensating transaction itself fails, potentially leaving the system in an inconsistent state?

Implement a Compensation Audit Log using Event Sourcing where every attempted compensation is recorded as an immutable event in Apache Kafka with infinite retention. The system must distinguish between transient infrastructure failures requiring automatic retry with exponential backoff and business logic violations requiring human intervention. For transient issues, use Dead Letter Queues (DLQ) in RabbitMQ or Amazon SQS that reprocess compensations after service recovery with jitter to prevent thundering herds. For business rule violations, such as attempting to refund an already settled transaction, the saga enters a 'COMPENSATION_FAILED' state that triggers PagerDuty alerts while applying the CQRS pattern to freeze the aggregate root via the command model. Always design compensations to be idempotent using database unique constraints or Redis SETNX operations, ensuring that retries create no side effects.

What is the fundamental architectural difference between choreography and orchestration regarding temporal coupling and the ability to answer 'what is the current transaction state' queries?

Choreography follows the Reactive Manifesto, creating temporal decoupling where services react to events without knowledge of upstream or downstream participants, but sacrificing the ability to query saga status without building complex Distributed Tracing with Jaeger or AWS X-Ray. The state becomes emergent from event logs, requiring CQRS read models projections to answer 'is the booking complete' questions. Orchestration introduces explicit temporal coupling between the coordinator and workers, as the orchestrator must be available to trigger next steps, but provides a single source of truth in its state store (PostgreSQL/CockroachDB). This allows immediate status queries but creates a network dependency. The critical insight is that choreography requires implementing state machines in every consumer, while orchestration centralizes this complexity; for systems requiring strong auditability and compliance (PCI-DSS), orchestration is preferred despite the coupling cost.

How do you prevent duplicate saga execution when using at-least-once delivery semantics in message brokers during Kafka consumer rebalancing or Kubernetes pod restarts?

Implement Idempotent Consumer patterns using Redis or Memcached to store processed message IDs with deduplication windows matching your Recovery Point Objective (RPO), typically 24-48 hours for financial systems. When a saga orchestrator receives a command, generate a deterministic idempotency key by hashing the correlation ID with a business key (customer ID + booking reference) before performing any side effects. Each domain service must validate this key against its Idempotency Store, implemented as a PostgreSQL table with unique constraints on composite keys or Bloom Filters in Redis for memory-efficient negative lookups. For long-running sagas, use Saga State Machines with optimistic locking via etcd version vectors to handle exactly-once processing semantics across distributed nodes. This prevents double-booking scenarios when consumer groups rebalance during deployments or during network partitions that trigger Kubernetes livenessProbe restarts.