System ArchitectureSystem Architect

How would you engineer a globally distributed, latency-sensitive feature store that serves pre-computed ML features to real-time inference endpoints across heterogeneous cloud regions, ensuring microsecond-level read latency for hot features through tiered caching, maintaining strong consistency between online and offline feature values during backfilling operations, and implementing automated feature drift detection with cross-region model retraining triggers?

Pass interviews with Hintsage AI assistant

Answer to the question

The architecture employs a dual-store pattern that strictly separates online serving from offline training concerns. The online tier utilizes Redis Cluster deployed on NVMe-backed instances within each region, fronted by Envoy Proxy for local load balancing and TLS termination. Feature updates flow through Apache Kafka acting as the immutable changelog, with Debezium CDC connectors capturing mutations from operational databases and streaming them to regional Redis consumers.

For offline storage, historical features are compacted into Apache Iceberg tables on S3, enabling time-travel queries and efficient batch processing via Apache Spark. Consistency during backfilling is achieved through vector clock versioning: each feature value carries a logical timestamp, and Redis Lua scripts perform atomic compare-and-swap operations to reject out-of-order writes, ensuring the serving path never observes partial backfill states.

Drift detection leverages Prometheus histograms scraped by an Apache Flink job performing real-time statistical analysis on feature distributions. When KL-divergence or population stability index exceeds thresholds, Flink triggers Argo Workflows to orchestrate cross-region model retraining and canary deployments.

Situation from life

A multinational fintech firm required real-time fraud detection capabilities across AWS, Azure, and on-premise data centers. The critical challenge involved serving rolling aggregation features—such as user transaction velocity over the last hour—to inference endpoints with sub-5ms latency. Their existing PostgreSQL read replicas suffered from replication lag exceeding 200ms during peak loads, causing fraud scoring models to operate on stale data and miss coordinated attacks.

Solution 1: Global Active-Active Database Deploying CockroachDB or Google Spanner promised serializable isolation and automatic global replication. This approach eliminated consistency concerns but introduced cross-region write latency exceeding 100ms due to Paxos consensus overhead. For high-velocity features requiring immediate visibility of new transactions, this latency proved unacceptable. Furthermore, operational costs scaled quadratically with read throughput, making it economically unfeasible for millisecond-level serving requirements.

Solution 2: Eventual Consistency with Regional Caches Implementing independent Redis clusters per region with asynchronous replication via Kafka MirrorMaker provided excellent read performance and linear scalability. However, this created critical consistency vulnerabilities during backfilling operations when data scientists recomputed historical features to correct data quality issues. Without strict versioning guarantees, the system served stale aggregates alongside fresh ones, leading to model inference skew and erroneous risk scores that falsely flagged legitimate transactions.

Solution 3: Tiered Caching with Vector Clocks (Chosen) We architected a tiered system using Redis as the hot tier and Kafka as the immutable source of truth. Each feature value carried a vector clock timestamp derived from the ingestion pipeline. During backfilling, Spark jobs wrote to S3 while emitting versioned events to Kafka. Regional consumers applied updates using Redis Lua scripts that performed server-side vector clock comparison, atomically rejecting out-of-order writes while accepting newer versions. For drift detection, we instrumented feature distributions via Prometheus histograms, feeding Flink for real-time statistical comparison against training baselines.

The result reduced P99 serving latency to 1.2ms globally, eliminated consistency violations during backfills, and reduced model degradation incidents by 94% through automated drift-triggered retraining pipelines.

What candidates often miss

How do you prevent cache poisoning during bulk historical feature backfills when the online serving layer must remain available?

Many candidates suggest simply pausing serving during backfills or using distributed transactions spanning the cache and database. The correct approach implements logical timestamps and shadow keyspaces. Backfill data streams through a separate Kafka topic with monotonically increasing version IDs. Online serving clusters maintain two Redis keyspaces: "current" and "staging". The backfill populates staging while serving reads from current. Upon completion, an atomic Redis RENAME operation swaps keyspaces in microseconds, or alternatively, the application layer queries both keyspaces and selects the higher versioned value. This ensures zero downtime and prevents serving of partial backfill states without complex coordination protocols.

What consistency model should govern the relationship between online and offline feature stores, and why does strong consistency fail at scale?

Candidates often mistakenly advocate for ACID transactions spanning both Redis and S3 using two-phase commit protocols. The offline store optimizes for throughput and batch immutability, while the online store optimizes for low-latency point reads. Strong consistency requires consensus overhead that introduces unacceptable latency in the serving path. Instead, adopt eventual consistency with bounded staleness guarantees. Use Kafka log compaction with a retention-based reconciliation window to ensure the online store converges to the offline store's state within a defined time boundary. For features requiring stricter guarantees, implement write-through caching where the online write acknowledgment waits for Kafka commit confirmation, accepting slightly higher latency for critical features while maintaining high throughput for others through asynchronous replication.

How do you handle feature versioning during A/B testing of models requiring incompatible transformations of the same raw data?

A common error is versioning only the model artifact while ignoring feature schema evolution, leading to training-serving skew. The solution implements feature namespaces and lineage tracking using DataHub or Apache Atlas. Each feature transformation receives a semantic version. The feature store maintains multiple versions simultaneously in Redis using prefixed keys. Model serving configurations specify required feature versions via Consul or etcd. When promoting a model from shadow to production, the orchestration layer pre-warms caches for the new feature version using historical replay from Kafka before traffic shifts. This allows concurrent A/B tests using incompatible feature computations without data leakage between experiment cohorts or cold-start latency spikes.