System ArchitectureSystem Architect

Construct a planetary-scale, real-time data lineage and impact analysis substrate that captures field-level provenance across heterogeneous streaming and batch pipelines within a federated Data Mesh, orchestrates automated GDPR right-to-erasure propagation without full table scans, and predicts cross-domain schema change impacts while maintaining sub-second query latency over petabytes of metadata history.

Pass interviews with Hintsage AI assistant

Answer to the question

The architecture centers on a Hybrid Metadata Collection Layer that instruments data pipelines without modifying application code. Change Data Capture (CDC) agents intercept Apache Kafka topic schemas, Apache Spark execution plans, and JDBC query logs from legacy Oracle databases, emitting structured lineage events to a regional Apache Pulsar bus.

A Stream Processing tier using Apache Flink parses these events to construct a dynamic property graph in JanusGraph, where vertices represent datasets (tables, topics, files) and edges capture transformation logic with column-granularity cardinality. For GDPR automation, the system maintains an inverted index mapping PII signatures (e.g., email hashes, tokenized SSNs) to graph edges using Apache Lucene.

When a deletion request arrives, a Saga Orchestrator traverses the graph to identify impacted datasets, generates compensating Delta Lake vacuum commands and Kafka tombstone events, and executes them via Apache Airflow workflows with exactly-once semantics. Schema Impact Prediction leverages Graph Neural Networks (GNN) trained on historical lineage patterns to simulate blast radius of proposed Avro schema modifications, querying the graph via Gremlin with aggressive Redis caching for sub-second latency.

Situation from life

A multinational financial institution operating in EU, APAC, and US regions struggled with GDPR Article 17 compliance across their Data Mesh migration. Customer PII propagated through 500+ microservices, legacy mainframe extracts, and Snowflake analytics warehouses.

When a customer requested data deletion, manual audits required three weeks of SQL tracing across domains, often missing derived datasets in S3 data lakes. Simultaneously, schema changes in the Payments domain frequently broke Fraud Detection dashboards in the Analytics domain, causing six production incidents in one quarter.

Option A proposed a centralized Apache Hive Metastore with nightly Spark batch scans of all table schemas. This offered simplicity and strong consistency but introduced 24-hour staleness, violating GDPR's "without undue delay" requirement and failing to capture streaming transformations in Apache Flink jobs.

Option B suggested deploying eBPF kernel probes on all Kubernetes nodes to capture raw TCP payloads for deep packet inspection. While this provided real-time accuracy, it created severe privacy risks by potentially logging sensitive PII in the lineage store, incurred 40% CPU overhead, and violated data minimization principles.

Option C, which was ultimately selected, implemented Log-CDC agents that hook into Debezium connectors for databases and Kafka Interceptors for streaming pipelines. This captured only schema metadata and transformation logic without inspecting row values, achieving sub-minute lineage propagation while maintaining zero application code changes. Post-deployment, GDPR deletion latency dropped to under 5 minutes, schema change impact analysis became proactive with 85% prediction accuracy, and the bank passed its SOC 2 audit with zero findings regarding data provenance.

What candidates often miss

How do you handle lineage tracking for non-deterministic transformations, such as User-Defined Functions (UDFs) in Spark or Python transformations that dynamically alter column schemas based on external API calls?

Most candidates assume all transformations are static and declarative. In reality, UDFs are black boxes. The solution requires Static Analysis of Python bytecode or Scala abstract syntax trees (AST) during the CI/CD pipeline to extract column references before deployment.

For truly dynamic schemas (e.g., JSON blob parsing with varying keys), the system must implement Schema Inference Sampling, where the lineage collector samples a subset of records to probabilistically map potential output fields to input fields, tagging these edges with confidence scores in the graph.

Additionally, Runtime Schema Registry checks using Confluent Schema Registry can validate actual output schemas against inferred lineage, flagging drift when UDFs change behavior unexpectedly.

How do you reconcile lineage accuracy when stream processing jobs handle late-arriving data with event-time watermarks that cause retroactive updates to windowed aggregations?

Candidates often model lineage as immutable DAGs, but Apache Flink and Kafka Streams allow window recalculation. The architecture must implement Temporal Versioning on graph edges, where each lineage relationship is timestamped with the event-time watermark and processing-time version.

When late data triggers a recalculation, the system creates a new temporal edge while preserving the historical one, using Valid-From/Valid-To timestamps. The Gremlin queries must then default to the latest temporal slice but support historical audits.

Furthermore, the GDPR deletion saga must use Lookback Windows that account for these late arrivals, ensuring deletions propagate to reprocessed aggregates even if they occur hours after the initial window closed.

How do you maintain lineage graph consistency during blue-green deployments where physical table names or Kafka topic names change, but logical domain entities remain constant?

Candidates frequently conflate physical and logical identifiers. The solution requires a Logical Entity Resolution Layer using Persistent Identifiers (PIDs) assigned at the domain level via UUID generation during infrastructure provisioning.

When a blue-green swap occurs (e.g., table orders_v1 is replaced by orders_v2), the CDC agent emits a Rename Event to the lineage bus rather than creating a disconnected subgraph. The JanusGraph model must support Supernodes representing logical datasets with edges to physical incarnations tagged with deployment labels.

The Saga Orchestrator uses these logical pointers to ensure GDPR deletions follow the active physical incarnation while preserving historical lineage for the retired version, preventing orphaned metadata during rapid deployment cycles.