Answer to the question

Modern streaming architectures for IoT telemetry leverage Apache Kafka as the distributed event backbone, handling millions of messages per second with durable persistence and horizontal scalability. Apache Flink serves as the stream processing engine, providing true streaming semantics with sophisticated event-time processing capabilities and coordinating with Kafka transactions to guarantee exactly-once delivery semantics across the entire pipeline. State management utilizes RocksDB embedded backends with incremental asynchronous snapshots to Amazon S3, enabling terabyte-scale stateful operations without exhausting JVM heap memory. For immediate alerting, hot aggregation results are materialized in Redis, while historical data flows to S3 Glacier via Apache Iceberg tables for cost-effective analytical queries.

Situation from life

A smart energy utility monitors two million smart meters generating ten thousand events per second, requiring detection of power grid anomalies within 500 milliseconds to prevent cascade failures. The core challenge involves processing events arriving up to five minutes late due to cellular network partitions, eliminating duplicates from meter retry logic, and joining high-velocity telemetry with slowly-changing reference data containing device calibration metadata. Engineers previously struggled with false positives caused by out-of-sequence events and data loss during peak loads, necessitating a robust architecture that maintains accuracy without sacrificing real-time responsiveness.

Solution 1: Lambda Architecture with Spark Streaming and Batch

The initial proposal adopted a Lambda Architecture pattern. Apache Spark Streaming powered the speed layer for approximate real-time views, while nightly Spark SQL batch jobs recomputed exact results over HDFS for the preceding 24 hours.

Pros: Mature ecosystem with extensive tooling, straightforward fault tolerance via HDFS replication, and clear separation of concerns between speed and batch layers.

Cons: Code duplication between streaming and batch logic creates significant maintenance overhead and synchronization bugs. Reprocessing terabytes daily incurs prohibitive compute costs and violates the sub-second anomaly correction requirement due to batch latency.

Solution 2: Kafka Streams with Embedded Stores

A second design considered Kafka Streams with embedded RocksDB state stores running directly on application pods, avoiding external cluster management.

Pros: Simplified operational topology without separate processing clusters, tight native integration with Kafka's consumer groups, and automatic partition assignment handling.

Cons: Scaling stateful operations triggers expensive rebalancing of all partitions, causing significant latency spikes. Handling out-of-order events requires complex custom timestamp extraction logic, as the default windowing relies on processing time rather than event time. Memory constraints on application servers severely limit total state size, preventing large windowed aggregations.

Solution 3: Apache Flink with Event-Time Semantics

The selected architecture deployed Apache Flink on Kubernetes, leveraging event-time processing semantics with watermarks and externalized incremental checkpoints to Amazon S3.

Pros: Native event-time processing via watermarks and allowedLateness configurations handles out-of-order data without custom logic. Exactly-once semantics are achieved through two-phase commits coordinating Flink checkpoints with Kafka transactions. RocksDB incremental snapshots enable independent scaling of compute and state, supporting terabyte-scale keyed windows without memory pressure.

Cons: Significant operational complexity requires deep expertise in checkpoint tuning, watermark alignment, and backpressure management. The Flink JobManager represents a potential single point of failure necessitating Kubernetes high-availability configurations.

Chosen Solution and Result

We adopted Solution 3, configuring Flink's BoundedOutOfOrdernessWatermarks with a five-minute tolerance and RocksDB incremental checkpoints every 30 seconds. Duplicate elimination was achieved by enabling Kafka's idempotent producers and transactional writes coordinated with Flink's two-phase commit protocol. Data tiering to S3 Glacier utilized Apache Iceberg compaction strategies to maintain queryable historical datasets without excessive storage costs.

This architecture achieved 300ms p99 alerting latency and 99.99% processing accuracy during production trials. The system gracefully handled a three-hour cellular network partition by replaying from Kafka offsets after checkpoint restoration, with zero data loss. Storage costs decreased by 60% compared to the previous HDFS solution, while Grafana dashboards provided real-time visibility into Flink's watermark lag and checkpoint duration metrics.

What candidates often miss

Question: How does Apache Flink maintain exactly-once semantics when sinking to Kafka, and what prevents duplicate writes during job restarts?

Flink implements exactly-once via a two-phase commit protocol between the checkpoint barrier and the Kafka transaction. During the pre-commit phase, data is flushed to Kafka using a unique transactional.id but remains uncommitted until the checkpoint completes successfully. If the checkpoint fails, Flink aborts the transaction, causing Kafka to discard the data; upon restart, Flink restores the producer state from the last successful checkpoint to prevent zombie transactions from incomplete writes. Candidates often miss that the transactional.id must embed the checkpoint ID to ensure idempotency across restarts, and that Flink requires setTransactionalIdPrefix configuration to avoid collisions in multi-tenant Kafka clusters.

Question: Why does event-time windowing cause state explosion in keyed operations, and how do you mitigate this when processing unbounded device ID streams?

Event-time windowing causes state explosion because Flink must buffer all events for each key until the watermark passes the window end time plus the configured allowedLateness duration. For high-cardinality keys like unique device identifiers, this accumulates millions of concurrent window states in RocksDB, eventually consuming all available disk and memory resources. Mitigation requires implementing State TTL (Time-To-Live) configurations to automatically expire stale windows, configuring RocksDB memory-managed buffers to limit off-heap usage, and using incremental checkpoints to reduce snapshot overhead. Candidates frequently overlook that without explicit window eviction or TTL settings, the state backend grows indefinitely until the task manager encounters an Out-Of-Memory error, especially when processing late-arriving historical data.

Question: How do you resolve hot key skew when a single malfunctioning IoT device generates 100x the normal event volume, overwhelming a specific Flink subtask?

Hot key skew occurs when partition hashing concentrates high-volume keys onto single task instances, creating backpressure and latency spikes throughout the pipeline. The solution involves key salting—appending a random suffix (e.g., 0-9) to hot keys during the initial shuffle to distribute processing across multiple subtasks, then removing the suffix and re-aggregating results in a subsequent global window. Alternatively, implement local-keyed pre-aggregation using Flink's AggregateFunction before the shuffle to reduce network traffic, or utilize Kafka's sticky partitioning to throttle specific producers. Candidates often miss that salting increases network shuffle volume and state size, requiring careful balancing between parallelism gains and the overhead of managing synthetic keys in RocksDB.