Answer to the question

History of the question

The evolution of observability platforms has shifted from siloed data warehouses and expensive proprietary indices to unified lakehouse architectures that combine the flexibility of data lakes with the performance of warehouses. Early SaaS observability providers relied on Elasticsearch or Splunk clusters, which faced exponential cost curves at petabyte scale and struggled with true multi-tenant isolation. The emergence of open table formats like Apache Iceberg and Delta Lake enabled atomic transactions and time travel on object storage, while query engines like Trino matured to provide interactive SQL over cloud storage. This convergence created the possibility of serving thousands of tenants from a single shared infrastructure, but introduced novel challenges in maintaining sub-second latency while enforcing strict security boundaries and optimizing storage costs through intelligent tiering.

The problem

The core challenge lies in reconciling conflicting requirements: ingesting millions of events per second from diverse agents (Fluentd, Prometheus, OpenTelemetry) while providing interactive query performance over exabytes of historical data. Traditional shared-nothing databases collapse under cross-tenant query noise, while per-tenant silos create prohibitive operational overhead. Strict isolation mandates that tenant A’s queries cannot physically scan tenant B’s data, yet row-level security filters often introduce performance cliffs. Furthermore, storing all data on hot SSD is economically impossible, but moving cold data to Amazon S3 Glacier risks violating the sub-second SLA when archived data is suddenly queried. The catalog service—tracking partitions and schema evolution—must remain decentralized to avoid becoming a single point of failure or a throughput bottleneck during high-velocity ingestion.

The solution

Architect a tiered lakehouse using Apache Iceberg as the table format sitting atop Amazon S3, Azure Data Lake Storage, or Google Cloud Storage. Ingest streams via Apache Kafka or Amazon Kinesis, processing through Apache Flink or Spark Streaming to land data in the appropriate tier: hot (local NVMe SSD on query nodes), warm (S3 Standard), or cold (S3 Glacier Instant Retrieval). Deploy Trino or Presto as the distributed query engine, configured with Apache Ranger or AWS Lake Formation for row-level and column-level security policies that enforce tenant boundaries at the scan level. Implement a federated catalog using Hive Metastore federation or AWS Glue with regional replicas to avoid centralized bottlenecks. Automated tiering is driven by an ML-based heatmap analyzer that monitors query logs, promoting frequently accessed cold data back to warm storage and demoting stale hot data, while maintaining metadata pointers in Iceberg to ensure query transparency across tiers.

Situation from life

Detailed example:

NebulaObservability, a SaaS provider serving 12,000 enterprise customers, needed to replace their aging Elasticsearch cluster which was costing $2M/month at 8PB storage. Each customer generates 2-10TB/day of logs and metrics, requiring ad-hoc SQL analysis with sub-second dashboard load times. Regulatory requirements mandate strict isolation where Customer A cannot infer Customer B’s data existence through timing attacks or query errors. Data retention is mandated at 13 months, but 95% of queries hit only the last 72 hours. The previous architecture suffered from "noisy neighbor" issues where one customer’s large aggregation query would degrade performance for others.

Solution 1: Sharded ClickHouse Clusters

Deploying massive ClickHouse clusters with tenant-based sharding was considered. The pros included exceptional single-query performance and mature SQL support with vectorized execution. However, the cons were severe: operational complexity of managing petabyte-scale clusters, the difficulty of enforcing row-level security without performance degradation, and the inability to independently scale storage versus compute. Additionally, resharding ClickHouse clusters during tenant onboarding required hours of downtime and manual intervention.

Solution 2: Per-Tenant PostgreSQL with TimescaleDB

Provisioning isolated PostgreSQL instances with TimescaleDB extensions for each tenant offered perfect security isolation and simple backup strategies. The pros were straightforward: native row-level security, easy tenant deletion for GDPR, and no cross-tenant interference. The cons made this approach impossible: the operational nightmare of managing 12,000 database instances, patching cycles, and connection pool exhaustion. Storage costs would explode due to lack of compression compared to columnar formats, and cross-tenant analytics for the provider’s own usage insights became impossible.

Solution 3: Federated Lakehouse with Tiered Storage

Implementing the Apache Iceberg-based lakehouse with Trino and automated tiering provided the optimal balance. The pros included shared infrastructure economies of scale, Iceberg’s hidden partitioning preventing user errors, and S3’s infinite scalability. Row-level security via Apache Ranger allowed fine-grained policies without modifying queries. The automated tiering reduced storage costs by 70% by moving cold data to S3 Glacier while keeping metadata hot. The cons involved significant tuning complexity: query planning required careful partition pruning, and the tiering algorithm needed training data to avoid thrashing.

Chosen solution and why:

Solution 3 was selected because it uniquely satisfied the planetary scale requirement while maintaining strict isolation. The Iceberg format’s ability to atomically update table metadata allowed schema evolution without locking, critical for zero-downtime deployments. Trino’s connector architecture enabled pushing down predicates to S3, reducing scanned data. The automated tiering, using AWS Lambda functions triggered by Athena query logs, ensured cost optimization without manual intervention. This approach decoupled storage from compute, allowing independent scaling during traffic spikes.

Result:

The system achieved 650ms p99 query latency across 12PB of active data, supporting 50,000 concurrent queries during peak hours. Storage costs dropped by 68% compared to the previous Elasticsearch architecture, saving $1.36M monthly. The automated tiering correctly predicted 94% of data access patterns, with "cache misses" to cold storage occurring only 0.3% of the time. Zero security incidents related to cross-tenant data leakage were recorded during the first 18 months of operation, validated through quarterly penetration testing. Onboarding a new tenant became a purely metadata operation taking under 30 seconds.

What candidates often miss

How do you prevent query latency explosions when the automated tiering algorithm incorrectly demotes "warm" data that is suddenly accessed by a scheduled batch report?

Candidates often suggest reactive caching without considering the prediction mechanism. The detailed answer requires implementing a predictive tiering system using exponential smoothing on query access logs, maintaining a "lukewarm" intermediate tier on S3 Standard-IA with millisecond-first-byte latency before demotion to Glacier. Additionally, deploying Alluxio as a distributed caching layer between Trino and S3 absorbs unexpected access spikes. The critical detail is implementing "promotion on read": when cold data is accessed, the system asynchronously copies it back to the warm tier while serving the current request from S3 Glacier Instant Retrieval, ensuring subsequent queries hit faster storage.

How do you maintain ACID consistency for schema evolution (adding columns) across thousands of tenant tables without a global transaction coordinator becoming a bottleneck?

Most candidates propose distributed locking, which violates the "no centralized bottlenecks" requirement. The correct approach leverages Apache Iceberg’s optimistic concurrency control and metadata layering. Each tenant table has an independent metadata.json file lineage. Schema changes append a new metadata file with an incremented sequence number; the catalog (e.g., AWS Glue) only stores the pointer to the current metadata file. During commit, the writer checks if the pointer has changed (conflict) and retries if necessary. This eliminates the need for global locks because tenant tables are independent namespaces. For rare cross-tenant schema updates (e.g., adding a universal column), use a saga pattern with idempotent DDL operations rather than atomic transactions.

How do you architect the row-level security layer to prevent a "super-tenant" from performing a full table scan that starves CPU resources for other tenants, violating the sub-second SLA?

Candidates frequently miss resource governance mechanisms. The solution involves hierarchical resource isolation using Trino’s resource groups with hard CPU limits and memory quotas per tenant class (premium vs. standard). Implement admission control that estimates query cost using the Trino cost-based optimizer; queries exceeding tenant-specific thresholds are queued or rejected rather than executed. Use Kubernetes resource quotas to isolate query engine pods into tenant-specific node pools, preventing CPU starvation. Finally, implement query killing policies for long-running scans that exceed predicted costs, coupled with materialized views for common expensive aggregations, ensuring that even malicious or accidental full scans cannot impact other tenants’ latency.