In distributed architectures, error handling should be centralized, predictable, and resilient to various types of failures that are inevitable when working with network services. Recommended patterns include Retry, Circuit Breaker, Timeout, Fallback, and centralized logging/monitoring.
Principles:
Example of Circuit Breaker in Python using the pybreaker library:
import pybreaker import requests breaker = pybreaker.CircuitBreaker(fail_max=3, reset_timeout=60) @breaker def get_data(): return requests.get('http://service/api/data', timeout=3) try: response = get_data() except pybreaker.CircuitBreakerError: # fallback: return a stub or error response = 'Fallback data'
Key features:
Is it acceptable to give clients all the details of an exception during errors?
No. Exception details should not be disclosed — it's a security risk. Return only general information in responses, log technical details in internal systems.
Is it enough to simply implement "retry" for network errors between services?
No, a "pure" retry can worsen the problem — it is better to implement a strategy with backoff (increasing delay), rather than rigid retries.
Is it better to store logs on the local disk of each microservice?
No. The best option is centralized log collection (e.g., using ELK, Loki, Grafana), so all logs are available for search and analysis in a single point.