In the history of the development of distributed IT systems, questions of error handling and failure scenarios have long remained in secondary roles, yielding to business logic. However, the growth in scale and complexity of infrastructure has demonstrated over time that inadequate error handling scenarios lead to large-scale failures and data losses.
The problem is that complex systems experience many types of failures: from unavailable individual services to data inconsistencies or partial communication channel failures. Often, clients understand "failures" as only the obvious outages (for example, the server is unavailable), ignoring chains of inter-service errors or degradation of the user experience.
An effective solution is built on a systematic approach:
Key features:
What is the difference between an exception at the application level and at the infrastructure level?
Candidates often confuse business errors (for example, "user not found") with real failures (for example, "database unavailable"). The application must always clearly distinguish between the two types of exceptions and provide different handling strategies (rollback, notifications, alerting).
What failure scenarios should be modeled for an internal API if it is not public?
Failure scenarios are relevant for any API: even if the API is internal, failures can always occur (even within one automation contour), and they need to be explicitly modeled to properly deal with unreliable/missing data.
Should the system hide all errors from the user for maximum UX?
No, absolute hiding of errors leads to user misinformation. It is important to find a balance between informativeness (so that the user understands what to do next) and safety (without exposing implementation details).
Negative Case: In a large e-commerce project, a systems analyst left the handling of all network errors to the architecture. During emergency updates and failure of the mailing service, the system did not send notifications about orders, and users did not understand whether their orders were created.
Pros:
Cons:
Positive Case: The systems analyst, together with the architect, modeled separate scenarios for each critical service: unavailability of the mailing queue, payment gateway failures, degradation of the search service. User-friendly messages for clients were created.
Pros:
Cons: