System ArchitectureSystem Architect

How would you architect a globally distributed, multi-tenant time-series database platform that ingests high-cardinality telemetry from millions of heterogeneous IoT devices, maintains sub-second query latency across petabytes of tiered storage, and enforces geographically fragmented data sovereignty policies without centralized coordination bottlenecks?

Pass interviews with Hintsage AI assistant

Answer to the question.

History of the question.

The proliferation of Industry 4.0 and smart city infrastructure has transformed time-series data management from a niche Ops concern into a foundational layer of modern digital economies. Early solutions like Graphite or single-node InfluxDB served monolithic applications adequately, but the modern landscape involves millions of heterogeneous IoT endpoints emitting high-cardinality telemetry across fragmented geopolitical boundaries. The convergence of exponential data growth with stringent data sovereignty mandates—such as the Schrems II ruling in the European Union—has rendered centralized cloud architectures legally untenable, necessitating novel approaches to distributed storage that respect physical jurisdictional borders while preserving analytical coherence.

The problem.

The architectural challenge centers on the fundamental impedance mismatch between write-optimized ingestion paths and read-optimized analytical queries within a multi-tenant environment. High-cardinality dimensions—such as unique device identifiers or millisecond-precision timestamps—create explosive index growth that degrades performance in traditional B-tree or LSM-tree storage engines. Furthermore, enforcing strict tenant isolation without compromising resource utilization efficiency requires solving the "noisy neighbor" problem at planetary scale, where a single tenant's flash flood of sensor data cannot degrade query performance for others, all while maintaining ACID guarantees across regions subject to unpredictable network partitions.

The solution.

A Lambda-architecture pattern provides the theoretical foundation, separating the velocity layer (hot, recent data) from the batch layer (cold, historical data). The ingestion tier employs Apache Kafka or Apache Pulsar partitioned by geographic region to satisfy data residency requirements, with Kafka Streams performing real-time downsampling to mitigate cardinality pressure. Hot storage utilizes Apache Cassandra or ScyllaDB with composite primary keys (time_bucket, device_hash) to distribute write load, while cold storage leverages Apache Parquet files in S3-compatible object stores with Apache Iceberg table formats for schema evolution. Query federation through Trino or Presto aggregates across these heterogeneous tiers, with Envoy proxies enforcing geo-fencing logic at the network edge to prevent cross-border data leakage.

Situation from life

In late 2023, a multinational agricultural technology firm deployed soil sensors and drone imagery systems across 40,000 farms located in the United States, Brazil, and Germany. Each farm generated 2,000 distinct time-series metrics every 30 seconds—from pH levels to multispectral imaging data—resulting in a sustained load of 80,000 writes per second with extremely high cardinality due to unique sensor UUIDs. Their initial monolithic TimescaleDB deployment in AWS us-east-1 suffered catastrophic performance degradation during harvest seasons, with query latency for three-month yield trend analysis exceeding 60 seconds. Compounding the technical failure, GDPR compliance officers discovered that German farm data was being replicated to American availability zones for redundancy, creating immediate regulatory liability and potential fines of 4% of global revenue.

Solution A: Federated regional clusters with cross-region read replicas.

This approach proposed deploying independent InfluxDB clusters in each sovereign region, utilizing Kafka MirrorMaker to asynchronously replicate only aggregated, anonymized statistics to a global reporting cluster. The primary advantage was strict compliance with data residency laws, as raw telemetry never crossed borders. However, the asynchronous replication introduced significant latency in global analytics, with data staleness exceeding 15 minutes. Furthermore, the solution created a single point of failure in the global cluster, which would lose all query capabilities if network partitions isolated it from regional replicas, violating the availability requirements for real-time crop monitoring.

Solution B: Centralized cloud-native TSDB with client-side encryption and key escrow.

This strategy suggested adopting Amazon Timestream with AES-256 client-side encryption where European devices retained decryption keys locally, theoretically satisfying GDPR Article 44 regarding data transfers. The benefits included managed infrastructure and automatic scaling without operational overhead. The critical flaw was legal rather than technical: European courts have ruled that encrypted data still constitutes personal data if the controller holds the potential means of decryption, creating regulatory ambiguity. Additionally, Timestream's query engine struggled with high-cardinality joins across millions of unique sensor IDs, often timing out on complex agricultural queries involving geospatial overlays.

Solution C: Tiered storage architecture with edge pre-aggregation and CRDT-based reconciliation.

This solution implemented Telegraf agents on farm gateways to pre-aggregate 5-minute windows of telemetry, reducing cardinality by 95% through statistical summarization (mean, max, min, count) before ingestion. Regional Cassandra clusters stored hot data (30 days) with Time-To-Live compaction, while Apache Spark jobs compressed historical data into Parquet format in regional S3 buckets with Snappy compression. Trino federated queries across these tiers using Iceberg table abstractions, while Istio service mesh enforced strict geo-fencing at the network layer. The trade-off was increased architectural complexity and the need for sophisticated CRDT logic to merge edge-buffered data during network partitions, but this uniquely satisfied all technical and legal constraints.

Which solution was chosen (and why).

The engineering team selected Solution C after a six-week proof-of-concept, prioritizing legal certainty and query performance over operational simplicity. The CRDT-based conflict resolution proved essential for agricultural environments where network connectivity was intermittent, allowing tractors and drones to buffer metrics locally and merge states seamlessly upon reconnection without data loss. The cost savings from Parquet compression and S3 Glacier archival—estimated at 82% reduction in storage spend compared to hot-only storage—provided executive sponsorship for the increased engineering investment.

The result.

The production system now sustains 120,000 writes per second with P99 ingestion latency under 30ms and analytical query latency under 800ms for 12-month trend analysis across all 40,000 farms. The architecture successfully passed independent GDPR and LGPD (Brazil) compliance audits, confirming that raw telemetry remained physically within respective jurisdictions. During the 2024 harvest season, the system survived a three-hour complete outage of the us-east-1 region without data loss, automatically rerouting traffic to us-west-2 while maintaining strict data residency for German farms through the geo-federated query layer.

What candidates often miss

How do you prevent cardinality explosions from unique device IDs or high-frequency timestamps without losing the ability to drill down into individual device telemetry?

Many junior architects incorrectly suggest simply adding more Kafka partitions or scaling Cassandra nodes horizontally to absorb the write pressure. The sophisticated answer involves implementing a hierarchical aggregation strategy using Apache Flink or Kafka Streams to maintain "dual paths": raw high-cardinality data resides in the hot tier (SSD-backed ScyllaDB) for 24-48 hours with aggressive TTL policies, while simultaneously writing pre-aggregated, low-cardinality rollups (by farm zone or equipment type) to the warm tier. This requires designing Bloom filters to prevent duplicate processing during windowed aggregations and understanding that cardinality is fundamentally a storage problem, not merely a throughput issue, requiring careful selection of LSM-tree compaction strategies like Size-Tiered versus Leveled compaction based on the update frequency of specific metric dimensions.

What are the specific trade-offs between using time-based partitioning versus hash-based partitioning for the primary key in a distributed time-series store like Cassandra or ScyllaDB?

Candidates frequently default to time-based partitioning (e.g., partitioning by day) because it aligns logically with time-range queries and simplifies TTL-based data expiration. However, this creates severe hot-spotting on the latest partition node during high-velocity ingestion, violating the uniform distribution principle of distributed systems. The correct approach employs composite partition keys combining time buckets (e.g., hour) with a hash of the device ID to scatter writes, while using clustering columns for the precise timestamp to preserve time-range scan efficiency within each partition. One must also account for the "wide row" problem in Cassandra, where excessive clustering columns can cause heap pressure during compaction, necessitating a secondary index or SASI (SSTable Attached Secondary Index) strategy for specific query patterns, which introduces write amplification that must be modeled using the USL (Universal Scalability Law) to predict concurrency limitations.

How do you maintain causal consistency and total ordering of events across geographically distributed time-series replicas when network partitions occur and system clocks are unreliable?

This question probes deep understanding of distributed consensus in temporal contexts. Most candidates incorrectly propose NTP synchronization or Vector Clocks without understanding their limitations: NTP cannot guarantee millisecond precision across continents, and Vector Clocks scale poorly with node count in large clusters. The architecturally sound solution employs Hybrid Logical Clocks (HLC) embedded in the metric payload, which combine physical timestamps with logical counters to capture happens-before relationships without tight clock synchronization. During partitions, the system must implement Multi-Version Concurrency Control (MVCC) with conflict-free replicated data types (CRDTs) specifically designed for time-series—such as G-Counters for monotonic sums or PN-Counters for bidirectional metrics—allowing divergent regional replicas to merge automatically upon reconnection without administrative intervention or data loss, while preserving the causal chain of agricultural events like "irrigation stopped before soil moisture dropped."