Organizing monitoring and logging in distributed IT architectures is key to ensuring resilience, identifying and analyzing errors, as well as evaluating service performance. It's important to implement centralized logging and a metric system for all services to obtain a complete picture of what is happening in the system.
General steps:
Example of logging setup in Python:
import logging from pythonjsonlogger import jsonlogger logger = logging.getLogger() logHandler = logging.StreamHandler() formatter = jsonlogger.JsonFormatter() logHandler.setFormatter(formatter) logger.addHandler(logHandler) logger.setLevel(logging.INFO) logger.info('Service started', extra={'service': 'orders', 'env': 'prod'})
Key features:
Should we always log all actions of the service at the INFO level?
Answer: No, as this can quickly lead to an increase in log volume, reduce performance, and complicate the search for real errors. It’s better to adhere to the semantics of log levels (DEBUG for debugging, ERROR for critical issues, INFO for important events).
Is it enough to collect metrics only at the infrastructure level (CPU, RAM) and ignore metrics at the application level?
Answer: No. Application-level metrics (for example, response time, number of errors) are critical for business analytics and operational response; they help identify bottlenecks specifically in the logic of services, not just in the hardware.
Is standard HTTP request tracing by the web server sufficient in a distributed system?
Answer: Incorrect. For complex scenarios encompassing service chains, full distributed tracing with a unique Trace-Id is needed to see the path and processing time of the request at all stages.