System ArchitectureSystem Architect

Devise a fault-tolerant architecture for a globally distributed, highly available ID generation service that produces strictly monotonic, k-sortable identifiers across geographically dispersed data centers without relying on synchronized physical clocks, handles burst traffic of millions of IDs per second per region, and guarantees global uniqueness during arbitrary network partitions and regional outages?

Pass interviews with Hintsage AI assistant

Answer to the question

History of the question

The evolution of distributed ID generation traces back from centralized database sequences, which became bottlenecks in microservices architectures, to Twitter's Snowflake and UUID variants. Early approaches relied heavily on NTP-synchronized clocks, which proved fragile during leap seconds, clock drift, and network partitions. Modern requirements for event sourcing and globally consistent logging demanded strictly monotonic sequences that respect causality without coordination overhead.

The problem

Traditional approaches face the clock skew dilemma between availability and ordering. Pure physical timestamps require tight synchronization, violating partition tolerance per the CAP theorem, while pure logical clocks such as Lamport timestamps or Vector clocks sacrifice temporal locality and database compression efficiency. The challenge intensifies when requiring k-sortability for database indexing efficiency. This rough time-ordering must coexist with strict monotonicity, ensuring no backward movement during failover scenarios. Additionally, regional isolation during submarine cable cuts must not cause ID collision or availability loss.

The solution

Implement a Hybrid Logical Clock (HLC) architecture combining physical time (millisecond component) with logical counters, augmented by node ID partitioning. Each regional cluster receives a node ID (10-16 bits) from a consensus service like etcd or ZooKeeper only at startup or membership change. Within each node, the HLC increments its logical component when physical time hasn't advanced, ensuring monotonicity despite clock adjustments.

The ID structure combines: epoch milliseconds (41 bits) + logical counter (12 bits) + node ID (10 bits). During partitions, nodes continue allocating from their local logical counter space. Upon partition healing, the HLC's max-plus-one merging rule ensures causality preservation without central coordination.


Situation from life

A global cryptocurrency exchange required transaction ID generation across AWS us-east-1, eu-west-1, and ap-southeast-1. The system needed to process 8 million orders per second during market volatility while maintaining strict temporal ordering for regulatory audit trails. Network partitions during undersea cable maintenance had previously caused UUIDv4 collision risks in their legacy system, resulting in database unique constraint violations and trading halts.

Solution 1: Centralized PostgreSQL sequence with caching

Deploying a PostgreSQL sequence with application-level batch allocation (fetching 10,000 IDs at once) reduced database round-trips. However, during the Asia-Pacific network partition, the caching nodes exhausted their allocated ranges within 90 seconds, forcing fallback to UUID generation which broke audit trail ordering. The single RDS instance also created a 140ms latency penalty for cross-region writes, violating the sub-50ms generation requirement.

Solution 2: Snowflake-inspired Twitter algorithm

Implementing Snowflake with ZooKeeper-managed node IDs achieved 22,000 IDs/sec per node and excellent sortability with compact 64-bit IDs. However, when NTP daemons on European nodes experienced leap-second smearing while US nodes used immediate stepping, the system generated duplicate millisecond timestamps, requiring expensive database constraint checks that degraded throughput by 40%.

Solution 3: Hybrid Logical Clock with CRDT convergence

Adopting CockroachDB's HLC pattern, each regional leader maintained a local logical counter allowing 4096 IDs per millisecond per node with node ID space partitioned by region. During the Singapore cable cut, isolated nodes continued generating IDs using their logical counters, and upon reconnection, the HLC comparison function ensured no duplicates while preserving causality. This approach sacrificed 128-bit ID width for correctness guarantees and maintained availability during partitions.

Chosen Solution and Result

Solution 3 was selected due to its partition tolerance and monotonicity guarantees. The system successfully endured a 4-hour partition during a South China Sea cable maintenance, processing 12 million IDs/sec in the isolated Tokyo region without duplication. Post-reconciliation required zero ID rewrites due to the HLC's happens-before tracking, and storage costs decreased 15% compared to UUID due to lexicographic ordering reducing RocksDB compactions.


What candidates often miss

How do you handle clock skew when the physical component of an HLC jumps backward due to NTP corrections?

Most candidates assume NTP always moves time forward. In reality, aggressive clock skew correction can set time backward by hundreds of milliseconds. The solution requires maintaining a persistent monotonic clock (similar to CockroachDB's "synthetic" time): when the OS reports a timestamp less than the last allocated ID's physical component, the system ignores the physical regression and continues incrementing only the logical counter until real time catches up.

Additionally, implement clock bound propagation where nodes gossip their maximum drift confidence intervals, rejecting generation requests if local uncertainty exceeds 10ms. This mechanism detects desynchronized nodes before they issue IDs. This prevents the "rewind" anomalies that violate external consistency.

What are the operational implications of node ID exhaustion in a long-running cluster with ephemeral containers?

Candidates frequently overlook that 10-bit node IDs allow only 1,024 unique generators. In a Kubernetes environment with frequent pod restarts, naive ID allocation exhausts the namespace within weeks. The resolution implements epoch-based recycling: node IDs are leased with TTLs (Time-To-Live) in etcd, and recycled IDs enter a "tombstone" quarantine period exceeding the maximum clock skew (typically 24 hours).

During redeployment, the system checks the HLC of the last ID issued by that node ID. If the current global time minus that timestamp exceeds the quarantine, the ID is safe to reallocate. This requires a graveyard service tracking retired node metadata.

Why does k-sortability cause write hotspotting in distributed databases, and how do you mitigate it?

K-sortable IDs (like Snowflake) concentrate writes at the "hot end" of LSM-tree or B-tree structures, overwhelming the latest SSTable or rightmost leaf page. Candidates often miss that while k-sortability improves read locality, it creates write amplification in Cassandra or TiKV. The mitigation introduces entropy coding through shard prefixes: prepend a 4-bit hash of the node ID or client session to the ID, distributing writes across 16 RocksDB memtables while preserving rough temporal order.

For CockroachDB, use hash-sharded indexes on top of the ID column. Alternatively, employ write-decking where recent IDs are buffered in Redis Streams before batch insertion into cold storage. This decouples ingestion from compaction cycles.