SLA (Service Level Agreement) is a formal agreement between the client and the IT team that defines the parameters of service quality.
At the architecture level, compliance with SLA is ensured through technical means, processes, monitoring, and automation. It is important to have a clear understanding of the critical aspects of the system, its fault tolerance, and scalability.
# Example alert configuration for API response delay - alert: HighResponseLatency expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 1 for: 5m labels: severity: critical annotations: summary: Response time exceeds SLA (99% > 1 sec)
What are operational metrics and why are they needed?
Operational metrics are indicators characterizing the actual performance parameters of the system, such as availability, latency, and error rate. They are needed to measure how well the system meets the SLA and to quickly respond to deviations.
Code example:
# Example of exporting metrics through Prometheus client from prometheus_client import start_http_server, Summary REQUEST_TIME = Summary('request_processing_seconds', 'Request processing time')
SLA, SLO, and SLI: what’s the difference?
Does high availability alone ensure SLA compliance?
No, SLA includes not only availability but also performance (latency), stability (error rate), and correctness of operation. High availability alone does not guarantee that other SLA requirements are met.