System ArchitectureSystem Architect

Draft the architecture for a globally distributed, privacy-preserving observability pipeline that ingests petabyte-scale distributed traces from thousands of microservices across multiple tenants, enforces field-level encryption for sensitive attributes before data leaves the service boundary, maintains sub-second query latency for complex trace aggregations, and implements real-time anomaly detection on encrypted telemetry without decrypting sensitive fields at the aggregation layer.

Pass interviews with Hintsage AI assistant

Answer to the question

The architecture centers on a zero-trust telemetry pipeline where OpenTelemetry agents deployed as sidecars capture traces at the service level. These agents utilize field-level encryption using tenant-specific keys from HashiCorp Vault before data transmission, ensuring that sensitive Personally Identifiable Information (PII) never traverses the network in plaintext. Regional Apache Kafka clusters act as encrypted buffers, feeding into stream processors (Apache Flink) that perform privacy-preserving analytics using homomorphic encryption or tokenization techniques. A federated query layer built on ClickHouse or Apache Pinot maintains separate logical shards per tenant with shared infrastructure, enabling sub-second lookups via intelligent indexing and predicate pushdown. Anomaly detection operates on aggregated, differentially-private metrics rather than raw spans, utilizing Apache Spark for batch pattern recognition without centralizing decrypted sensitive data.

Situation from life

A global telehealth platform serving ten million patients daily faced a critical compliance gap. Their existing Jaeger tracing infrastructure captured full request payloads including medical records and PHI. This violated HIPAA and GDPR requirements while creating a massive security liability for the organization.

Solution A: Per-Tenant Isolated Observability Stacks

Each healthcare provider client would receive dedicated Kubernetes clusters running isolated Prometheus and Jaeger instances with separate storage backends. This approach guaranteed complete data segregation and simplified compliance audits. However, operational overhead proved prohibitive—managing 500+ separate clusters required a team of thirty engineers, and cross-tenant performance comparisons became impossible. Capital expenditure increased by 400% due to duplicated infrastructure and stranded capacity.

Solution B: Centralized Plaintext Aggregation with Role-Based Access Control

Implementing a single, massive Elasticsearch cluster with field-level RBAC and data masking at query time. This reduced infrastructure costs significantly and provided unified querying capabilities. The fatal flaw emerged during security audits: the aggregation layer contained decrypted PHI in memory and storage, creating a high-value attack target. Any compromise of the Elasticsearch cluster or privileged credentials would expose millions of records, failing zero-trust requirements and regulatory standards.

Solution C: Zero-Trust Field-Level Encryption with Federated Query Plane

Deploying OpenTelemetry collectors as sidecars that encrypt sensitive fields using deterministic AES-256 encryption with tenant-scoped keys before emission. Non-sensitive trace metadata (timestamps, service names, durations) remains plaintext for indexing, while payloads and tags containing PHI stay encrypted. A custom query proxy intercepts requests, routing them to regional ClickHouse clusters and orchestrating decryption only at the edge within the requesting service's memory space using temporary key leases from Vault. Anomaly detection utilizes Flink to analyze patterns in metadata and encrypted feature vectors without decryption.

Chosen Solution and Result

The team selected Solution C after a six-month proof of concept. This architecture achieved an average query latency of 650ms for 99th percentile complex trace lookups, well within the sub-second requirement. The platform passed HIPAA and GDPR audits with zero critical findings regarding telemetry handling. Operational costs decreased by 60% compared to Solution A, while the blast radius of any potential breach remained confined to individual service instances rather than the entire dataset. The anomaly detection system identified three critical performance regressions in production within the first month without exposing patient data to the platform engineering team.

What candidates often miss

Question 1: How do you handle key rotation for field-level encrypted telemetry without losing the ability to query historical traces that were encrypted with previous key versions?

Candidates often propose decrypting and re-encrypting the entire dataset during rotation, which is computationally prohibitive at petabyte scale. The correct approach involves implementing a key hierarchy using Envelope Encryption where data encryption keys (DEKs) encrypt the telemetry fields, and key encryption keys (KEKs) protect the DEKs. Store the DEK ID as unencrypted metadata alongside each span. During rotation, only re-encrypt the DEKs with the new KEK, keeping historical DEKs accessible but protected by the new master key. For deterministic encryption used in querying (to enable equality searches on encrypted fields like patient_id), implement Synthetic Initialization Vectors (IVs) derived from the plaintext hash, allowing consistent ciphertext generation across key rotations for specific fields while maintaining semantic security through key versioning.

Question 2: How do you prevent cardinality explosion in high-cardinality fields (such as user IDs or session tokens) within the observability backend while maintaining the ability to debug specific user journeys?

Many candidates suggest simply blocking high-cardinality fields entirely, which destroys debugging capability. The sophisticated solution employs Tokenization combined with Bloom Filters. High-cardinality identifiers get replaced with deterministic tokens at the collector level, while a separate, highly restricted sidecar maintains a mapping of hash(token) -> user_id for the last 24 hours only. For historical queries, engineers submit requests through a privacy gateway that validates business justification and temporarily rehydrates the specific token-to-user mapping for that query session. In the storage layer (ClickHouse), utilize LowCardinality data types for service names and operations, while storing tokens in sparse secondary indexes rather than primary sorting keys. This approach keeps the index size manageable (preventing the "too many parts" error in ClickHouse) while preserving the ability to reconstruct specific user traces when necessary through audited, time-bounded rehydration workflows.

Question 3: How do you implement differential privacy in real-time anomaly detection without destroying the statistical utility required for detecting micro-latency regressions?

Beginners often apply global noise addition uniformly, which either masks real anomalies (high epsilon) or leaks privacy (low epsilon). The architectural solution requires a two-tiered aggregation strategy. First, utilize Local Differential Privacy (LDP) at the OpenTelemetry agent level, where each service adds calibrated Laplace noise to its own histogram buckets before transmission. This protects individual traces while preserving aggregate distributions. Second, implement Secure Multi-Party Computation (SMPC) within the Flink cluster, where regional aggregators compute global statistics on encrypted counters without learning individual contributions. For latency detection specifically, employ Sparse Vector Techniques (SVT) that only expend privacy budget when anomalies exceed adaptive thresholds, rather than on every measurement. Configure epsilon budget splitting using Privacy Accounting libraries like Google Privacy-on-Beam, allocating 70% of the budget to rare critical alerts and 30% to routine health checks. This maintains sufficient signal-to-noise ratio to detect 5ms latency shifts while guaranteeing mathematical privacy bounds for individual user activities.