Manual Testing (IT)Senior Manual QA Engineer

Establish a comprehensive manual testing methodology to validate accurate state representation in a **Kubernetes** orchestration dashboard utilizing **Server-Sent Events** (SSE) for real-time **Pod** lifecycle updates, specifically targeting **RollingUpdate** deployment verification under **maxSurge** constraints, **OOMKilled** event propagation, and graceful degradation during **etcd** quorum loss scenarios?

Pass interviews with Hintsage AI assistant

Answer to the question

Manual validation of a Kubernetes orchestration dashboard requires treating the UI as a distributed system observer rather than a simple visualization layer. The methodology begins with establishing a controlled cluster environment using Minikube or Kind, deploying a sample application with explicitly configured RollingUpdate strategies including varying maxSurge percentages and maxUnavailable thresholds. Testers must monitor Server-Sent Events (SSE) streams through browser DevTools, verifying that pod state transitions propagate within defined SLA timeframes while simultaneously validating that Prometheus metric scraping intervals align with dashboard refresh cycles.

The testing process involves three concurrent validation threads. First, manipulate deployment replicas through kubectl while observing dashboard synchronization latency. Second, artificially induce resource pressure to trigger OOMKilled scenarios using memory-limit stress containers. Third, simulate control plane degradation by network-partitioning etcd nodes to observe graceful error handling. Critical checkpoints include verifying that terminating pods transition through distinct visual states (Terminating → ContainerCreating → Running), confirming that HorizontalPodAutoscaler events generate distinct notification badges, and ensuring that dashboard session persistence survives API Server failovers through proper JWT token refresh mechanisms.

Situation from life

During a critical migration for a logistics company moving from monolithic Java EE applications to containerized microservices, the operations team relied on a custom dashboard to monitor a RabbitMQ-backed order processing system. The problem manifested when the dashboard displayed a deployment as "100% Complete" with all pods showing green status indicators, yet customer orders were failing with connection timeouts. Investigation revealed that while the Deployment controller had created new pods, the ReadinessProbe configurations were misaligned with actual service dependencies, causing pods to receive traffic before completing Flyway database migrations.

Three distinct solutions were considered for detecting such synchronization failures in future releases. The first approach proposed implementing manual kubectl get pods verification before signing off any deployment, which provided absolute certainty about actual cluster state but required fifteen minutes per deployment and created dangerous toil that would inevitably be skipped during high-pressure releases.

The second solution suggested automated screenshot comparison testing using Selenium. While this caught visual regressions in pod status colors, it failed to detect temporal misalignments where the UI briefly showed correct states before caching stale data during SSE reconnections.

The third methodology involved structured chaos engineering with controlled failure injection. This approach created NetworkPolicy objects to simulate etcd leader elections while monitoring dashboard behavior through browser DevTools' EventStream inspection.

The team chose the third solution because it addressed the root cause. The dashboard's React frontend was caching Pod objects in Redux state without invalidation during connection drops. This caused the UI to show "Ready" pods that had actually been OOMKilled and rescheduled.

By systematically blocking SSE connections for thirty-second intervals while simultaneously killing pods via kubectl delete, testers discovered that the reconnection logic replayed cached state before receiving fresh updates from the Kubernetes API Server. The result was a critical bug fix where the development team implemented etag-based cache invalidation headers, reducing incident response time by 80% and preventing the false-positive deployment confirmations that had previously plagued production releases.

What candidates often miss

How do you accurately verify that Server-Sent Events (SSE) are delivering real-time updates without having access to server-side logs or the ability to modify backend code?

Many candidates suggest simply "waiting to see if the UI updates," but this fails to distinguish between polling mechanisms and true event-driven architectures. The correct approach involves opening browser DevTools and navigating to the Network tab's EventStream section, where you can inspect individual message payloads and their timestamps.

You should verify that the Content-Type header reads text/event-stream and that messages arrive as discrete events rather than batched HTTP responses. Additionally, test for reconnection resilience by using Chrome DevTools to simulate "Offline" mode for thirty seconds, then restoring connectivity while monitoring whether the client sends a Last-Event-ID header to request missed events, ensuring no state transitions were lost during the interruption.

What is the critical distinction between ReadinessProbe failures and LivenessProbe failures in a Kubernetes dashboard context, and why does confusing them lead to false-positive deployment validations?

Candidates frequently miss that ReadinessProbe failures remove pods from Service endpoints (stopping traffic) without killing containers, while LivenessProbe failures trigger container restarts. In dashboard testing, this distinction manifests visually: a failing readiness probe should show a yellow "Not Ready" badge while the pod remains running, whereas liveness failures should show red "CrashLoopBackOff" states.

To test this properly, you must deploy a "flaky" application with endpoints that can toggle probe responses via environment variables, then verify that the dashboard accurately reflects endpoint changes in the Endpoints or EndpointSlice objects visible through kubectl port-forwarding comparisons. Confusing these states causes testers to approve deployments where applications are running but not serving traffic, leading to silent production outages.

When testing dashboard resilience during etcd quorum loss, how do you distinguish between acceptable API Server degradation and critical UI failures that would mislead operators during incident response?

Most testers focus only on happy-path scenarios and miss that Kubernetes remains partially functional during etcd disruptions—reads often succeed while writes fail. A sophisticated testing methodology involves establishing a Kind cluster with three control plane nodes, then using iptables rules to block 2379/tcp traffic between nodes to simulate network partitions.

While the API Server returns HTTP 500 errors for write operations, the dashboard should display clear "Control Plane Degraded" banners while continuing to show cached pod states with explicit "Last Updated" timestamps. Candidates often fail to verify that the UI prevents destructive actions (like scaling deployments or deleting pods) during these windows, instead of merely showing spinners. The correct validation includes attempting dashboard operations and confirming they surface user-friendly error messages derived from the API Server's Status objects rather than generic JavaScript console errors.