History of the question

Stream processing architectures evolved from Apache Storm's record-at-least-once processing to modern exactly-once guarantees introduced by Apache Flink and Spark Structured Streaming. As enterprises migrated from batch Lambda architectures to continuous Kappa streams, complexity shifted from simple transformations to managing distributed state for windowed aggregations and sessionization. The emergence of data sovereignty requirements and regional latency constraints necessitated active-active deployments without relying on shared NFS or SAN storage, creating new challenges for state consistency during geographic failovers.

The problem

Stateful stream processing requires maintaining gigabytes of operator state (keyed windows, session stores) locally on processing nodes while ingesting millions of events per second. Exactly-once semantics demand atomic commits across three components: source offset tracking, state backend updates, and sink writes. Cross-region active-active replication without shared storage introduces split-brain risks when network partitions occur, while autoscaling requires live state migration without dropping in-flight records or violating processing-time guarantees. Supporting multiple languages (Java, Python, Go) traditionally forces serialization overhead or language-specific runtime lock-in.

The solution

The architecture employs a decoupled design with Apache Kafka or Apache Pulsar as the unified log, processing nodes running on Kubernetes with language-agnostic gRPC sidecars for polyglot support. State management uses embedded RocksDB with asynchronous incremental checkpoints to S3-compatible object storage, coordinated via a lightweight distributed coordination service (etcd or ZooKeeper). Exactly-once semantics are achieved through the Chandy-Lamport snapshot algorithm for state and two-phase commit (2PC) protocols for transactional sinks (Kafka transactions or idempotent JDBC writes). Cross-region replication utilizes log-based state shipping via Kafka MirrorMaker 2 or Pulsar Geo-Replication, with conflict resolution through CRDT-based commutative counters for aggregations and versioned primary ownership for keyed state.

Answer to the question

The platform consists of four logical layers: ingestion, processing, state management, and coordination.

Ingestion Layer

Apache Kafka clusters operate in multiple regions with MirrorMaker 2 enabling bidirectional topic replication. Producer idempotency and transactional IDs ensure exactly-once ingestion even during producer failover between regions.

Processing Layer

Apache Flink or similar stream processors run as Kubernetes StatefulSets. Each TaskManager exposes a gRPC sidecar that accepts Protobuf-serialized tasks, enabling Python and Go user-defined functions (UDFs) to execute within gRPC containers while the Java runtime manages state and checkpointing. The JobManager shards topology across TaskManagers using consistent hashing on record keys.

State Management

Operator state backends use RocksDB with enableIncrementalCheckpointing. Checkpoints write delta state changes to regional S3 buckets asynchronously every 15 seconds. For cross-region consistency, active-active deployments use LWW-Element-Set CRDTs for monotonic aggregations (counts, sums) and primary-key affinity for non-commutative operations. During regional failover, standby TaskManagers hydrate state from S3 using Savepoints.

Exactly-Once Guarantees

The system implements end-to-end exactly-once through:

Two-Phase Commit: Sinks participate in Flink's TwoPhaseCommitSinkFunction, pre-committing to Kafka or PostgreSQL during checkpoints and committing on successful checkpoint notification.
Idempotent Producers: Upstream Kafka producers use idempotent delivery with sequence numbers to deduplicate retries.
Transaction Isolation: Checkpoints act as transactional boundaries; uncommitted data remains invisible to downstream consumers.

Situation from life

A global ride-sharing platform required real-time surge pricing calculations aggregating driver availability and ride demand per geohash across AWS us-east-1 and AWS eu-west-1. The previous architecture used a single-primary Redis cluster with replication lag, causing 2-second failover windows where pricing calculations produced stale or duplicate surge multipliers during regional outages, resulting in incorrect fare calculations and customer complaints.

Solution 1: Active-Passive with Shared Storage

The team considered mounting EFS (shared NFS) across regions for state storage. Pros: Simplified failover with single writer semantics, strong consistency. Cons: EFS latency exceeded 100ms for cross-region access, violating the 50ms processing SLA; additionally, NFS write consistency issues caused checkpoint corruption during network partitions.

Solution 2: Lambda Architecture

Implementing a speed layer with Kafka Streams and a batch layer with Spark for corrections. Pros: Fault tolerance through immutable logs, simple recovery. Cons: Operational complexity maintaining dual code paths; batch corrections arrived too late for surge pricing which required sub-second accuracy to balance supply-demand.

Solution 3: Active-Active Stream Processing with CRDTs

Deploying Apache Flink in both regions with RocksDB state, incremental S3 checkpoints, and CRDT-based counters for ride counts. Pros: Local processing latency under 20ms, automatic conflict resolution for concurrent regional updates, zero-downtime failover. Cons: Required refactoring aggregations to be commutative (using G-Counters and PN-Counters), increased storage costs for dual regional checkpoints.

The team selected Solution 3 because the business requirement of 99.99% availability with sub-second failover could not tolerate the 2-second window of Solution 1 or the latency of shared storage. They implemented G-Counters for driver counts and LWW-Registers for latest pricing multipliers.

Result

The system achieved exactly-once surge pricing calculations with 15ms p99 latency in both regions. During a simulated us-east-1 outage, eu-west-1 seamlessly continued processing using locally replicated state without duplicate fare calculations. Checkpoint recovery time averaged 800ms, well within the sub-second requirement.

What candidates often miss

How does checkpoint interval tuning interact with backpressure mechanisms in stateful stream processors?

Many candidates optimize checkpoint intervals for recovery time without considering backpressure propagation. When checkpoint barriers align slowly due to backpressure, the Chandy-Lamport algorithm pauses pipeline execution, potentially causing cascading timeouts. The correct approach involves aligning checkpoint timeouts with backpressure thresholds, using unaligned checkpoints (where barriers overtake buffers) during high load, and separating synchronous versus asynchronous checkpoint phases. RocksDB's incremental checkpoints must be throttled using RateLimiter configurations to prevent SST compaction from overwhelming the disk I/O and exacerbating backpressure.

What is the fundamental difference between at-least-once delivery combined with idempotent sinks versus true exactly-once processing semantics?

Idempotent sinks guarantee that duplicate processing produces the same output state (e.g., UPSERT operations in PostgreSQL or HBase), but they expose intermediate states during retries. If a sink writes records A, B, then crashes and retries writing A, B, C, downstream observers momentarily see A, B, A, B, C before deduplication. True exactly-once (effectively-once) uses transactional isolation where pre-committed data remains invisible until checkpoint completion. This requires the sink to support transactions (e.g., Kafka transactions with isolation.level=read_committed) or two-phase commit protocols. Candidates often miss that idempotency solves the correctness problem but not the consistency/visibility problem during recovery.

How should event-time windowing handle late-arriving data during cross-region failover scenarios?

When failover occurs from Region A to Region B, in-flight records in Region A's network buffers may be lost or delayed beyond the watermark horizon. Candidates frequently suggest extending watermarks indefinitely, which breaks window completeness guarantees. The correct architecture uses Side Outputs (in Flink terminology) for late data capture combined with Allowed Lateness specifications. During failover, the system should hydrate windows from S3 Savepoints with timestamps, then merge late-arriving records from the failed region's dead letter queue into subsequent windows or trigger specific late-data handlers. Additionally, watermark generation must be idempotent across regions; using wall-clock time for watermarks causes divergence during failover, so watermarks must derive from monotonic event-time extraction across both active regions.