Answer.

SLA (Service Level Agreement) is a formal agreement between the client and the IT team that defines the parameters of service quality.

At the architecture level, compliance with SLA is ensured through technical means, processes, monitoring, and automation. It is important to have a clear understanding of the critical aspects of the system, its fault tolerance, and scalability.

Code example (monitoring SLA with Prometheus and Alertmanager):

# Example alert configuration for API response delay
- alert: HighResponseLatency
  expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 1
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Response time exceeds SLA (99% > 1 sec)

Key features:

It is essential to identify business-critical code paths and impose SLA metrics on them.
The architecture should include a system for collecting and storing metrics, an alerting mechanism, and redundancy.
Automated monitoring tools and a centralized logging system should be implemented.

Tricky questions.

What are operational metrics and why are they needed?

Operational metrics are indicators characterizing the actual performance parameters of the system, such as availability, latency, and error rate. They are needed to measure how well the system meets the SLA and to quickly respond to deviations.

Code example:

# Example of exporting metrics through Prometheus client
from prometheus_client import start_http_server, Summary
REQUEST_TIME = Summary('request_processing_seconds', 'Request processing time')

SLA, SLO, and SLI: what’s the difference?

SLA — agreement on quality between the client and the service.
SLO — specific goals or thresholds (an SLA may include multiple SLOs).
SLI — actual measurement of a parameter (e.g., % of successful requests per hour).

Does high availability alone ensure SLA compliance?

No, SLA includes not only availability but also performance (latency), stability (error rate), and correctness of operation. High availability alone does not guarantee that other SLA requirements are met.

How to organize SLA (Service Level Agreement) at the architecture level of IT systems and what metrics are important to consider?

Answer.

Code example (monitoring SLA with Prometheus and Alertmanager):

Key features:

Tricky questions.