System ArchitectureSystem Architect

Engineer a fault-tolerant, strongly consistent metadata control plane for a globally distributed object storage system that indexes billions of objects across heterogeneous storage tiers, handles concurrent namespace modifications with optimistic concurrency control, and guarantees linearizable consistency with sub-millisecond read latency for hot paths?

Pass interviews with Hintsage AI assistant

Answer to the question

The evolution of object storage has shifted from centralized metadata databases, such as early Ceph and Swift implementations, toward distributed architectures capable of hyperscale. This transition introduced a fundamental tension between the need for POSIX-like semantics (atomic renames, strict serializability) and the horizontal scalability required to manage billions of keys across SSD, HDD, and Tape tiers. The core problem lies in coordinating concurrent mutations across distributed nodes without incurring the latency penalties of traditional Two-Phase Commit (2PC) protocols or Paxos-based global consensus.

The solution requires a sharded consensus architecture where each shard governs a specific namespace partition using the Raft protocol to ensure linearizable consistency within that boundary. A Metadata Router layer directs client requests based on directory prefixes, utilizing consistent hashing to maintain locality for range queries while enabling horizontal scaling. For performance, hot metadata resides in a Tiered Cache comprising Redis for L1 and local RocksDB for L2 persistence, while cold metadata is compacted into Apache Parquet files on S3 to reduce storage costs without sacrificing durability.

Situation from life

A media company migrating from AWS S3 to a private hybrid cloud needed to manage 2 billion video segments with metadata tracking encoding profiles, DRM keys, and lifecycle states. Their initial architecture used MongoDB with automatic sharding, which suffered from unpredictable latency spikes (100-500ms) during chunk migrations and lacked atomic cross-shard transactions, causing data corruption during concurrent folder moves.

Solution 1: CockroachDB (Distributed SQL)

This approach offered native horizontal scaling and serializable isolation through a Google Spanner-like architecture. The primary advantage was the familiar SQL interface for complex analytics queries on metadata. However, the system exhibited high write amplification (3-5x) due to multi-region consensus replication, and latency consistently exceeded 20ms during viral content uploads when write contention peaked. The licensing costs for petabyte-scale metadata storage proved economically unfeasible for the organization.

Solution 2: Apache Cassandra with Lightweight Transactions (LWT)

Cassandra provided massive write throughput and tunable consistency, with Paxos-based LWTs offering linearizable operations. The technology excelled at ingesting high-velocity metadata streams without single points of failure. Unfortunately, Paxos latency averaged 15ms and degraded significantly under concurrent access, while secondary indexing for "list by upload date" queries required expensive full table scans, making it unsuitable for interactive user experiences.

Solution 3: Custom Shard-per-Directory with Raft

This design mapped each user directory to a dedicated Raft consensus group, ensuring that operations within a channel (the primary unit of access) were linearizable and fast due to local disk access. The architecture supported atomic renames via shard-local transactions without cross-network coordination. While this introduced complexity in resharding logic for viral directories (hotspots) and required a sophisticated client-side routing library, it perfectly matched the workload pattern where video content naturally partitioned by creator.

Result: The system successfully sustained 80,000 metadata operations per second during viral events with P99 latency under 3ms. Automated tiering policies moved 90% of aging content to cold storage, reducing total infrastructure costs by 60% while maintaining strict consistency guarantees for active content.

What candidates often miss

How do you prevent thundering herd issues on the metadata cache when a popular object expires or is updated?

Candidates often suggest simple TTL-based expiration without considering stampede protection. The correct approach implements lease-based caching where cache entries carry short-lived lease tokens, ensuring only the lease holder refreshes from the backend while others wait or serve stale data briefly. Combine this with probabilistic early expiration (adding random jitter to TTLs) and the singleflight pattern (as implemented in Go's singleflight package) to collapse concurrent identical requests into a single backend query, preventing database overload during cache misses.

What strategy ensures metadata consistency during a live shard split (resharding) operation without cluster downtime?

Many propose stopping writes during migration, which violates availability requirements. The proper technique uses shadow indexing and double-write protocols. First, instantiate the new target shard as a lagging replica of the source shard using Raft log shipping. Once synchronized, switch the write path to the new shard while maintaining a tombstone log in the old shard for a grace period to handle straggler reads. A coordination service like etcd atomically updates the routing table, while MVCC timestamps ensure reads during the transition see consistent snapshots, rejecting requests that span the split boundary until atomic cutover completes.

How do you reconcile the metadata index with the physical storage layer when asynchronous garbage collection or tiering migrations fail silently?

This requires an event-sourced approach with Saga patterns for distributed transactions. The metadata service emits domain events (e.g., "TieringInitiated") to an Apache Kafka log, with the storage consumer acknowledging successful processing via idempotent callbacks. If the storage layer fails to migrate the object within a specified timeout, the metadata service receives a Saga timeout event and triggers a compensating transaction to revert the metadata state to "HOT". Additionally, implement a background reconciliation scanner using Merkle trees to efficiently identify divergences between metadata and physical storage HEAD requests, repairing inconsistencies without requiring full table scans.