Business AnalysisSystems Analyst, Lead Systems Analyst

How to analyze and agree on failure scenarios and error handling in complex distributed IT systems?

Pass interviews with Hintsage AI assistant

Answer.

In the history of the development of distributed IT systems, questions of error handling and failure scenarios have long remained in secondary roles, yielding to business logic. However, the growth in scale and complexity of infrastructure has demonstrated over time that inadequate error handling scenarios lead to large-scale failures and data losses.

The problem is that complex systems experience many types of failures: from unavailable individual services to data inconsistencies or partial communication channel failures. Often, clients understand "failures" as only the obvious outages (for example, the server is unavailable), ignoring chains of inter-service errors or degradation of the user experience.

An effective solution is built on a systematic approach:

  • Detection of all possible failure points.
  • Development of comprehensive scenarios for their occurrence together with architects, QA, designers, and operations engineers.
  • Agreement on the behavior of the system with the business (for example, whether orders can be delayed or if operations need to be cached).
  • Clear documentation of all types of error messages and processing routes.

Key features:

  • Handling not only fatal but also soft/temporary failures (for example, temporary unavailability of an external service).
  • Inclusion of UI and functionality degradation scenarios.
  • Distinction between business errors and technical failures at all stages of requirement development.

Tricky Questions.

What is the difference between an exception at the application level and at the infrastructure level?

Candidates often confuse business errors (for example, "user not found") with real failures (for example, "database unavailable"). The application must always clearly distinguish between the two types of exceptions and provide different handling strategies (rollback, notifications, alerting).

What failure scenarios should be modeled for an internal API if it is not public?

Failure scenarios are relevant for any API: even if the API is internal, failures can always occur (even within one automation contour), and they need to be explicitly modeled to properly deal with unreliable/missing data.

Should the system hide all errors from the user for maximum UX?

No, absolute hiding of errors leads to user misinformation. It is important to find a balance between informativeness (so that the user understands what to do next) and safety (without exposing implementation details).

Typical Mistakes and Anti-Patterns

  • Unformalized failure handling (left to "default" catches).
  • Lack of degradation scenarios in partial failures (for example, in microservices - a non-functioning cart completely blocks the order checkout).
  • Ignoring the accumulation of "silent" failures (no alerting/monitoring for exceptional situations).

Real-Life Example

Negative Case: In a large e-commerce project, a systems analyst left the handling of all network errors to the architecture. During emergency updates and failure of the mailing service, the system did not send notifications about orders, and users did not understand whether their orders were created.

Pros:

  • Simplification of requirements description.

Cons:

  • Data loss (impossible to prove order creation).
  • Support costs increased after product launch.

Positive Case: The systems analyst, together with the architect, modeled separate scenarios for each critical service: unavailability of the mailing queue, payment gateway failures, degradation of the search service. User-friendly messages for clients were created.

Pros:

  • Improved customer trust in the platform.
  • Minimization of operational risks.

Cons:

  • Increased documentation volume and complexity in testing.