System ArchitectureTechnical Lead

How is SLA (Service Level Agreement) designed at the IT architecture level and what metrics should be considered?

Pass interviews with Hintsage AI assistant

Answer.

Designing an SLA (Service Level Agreement) in the system architecture is assigning controlled, measurable, and monitorable quality performance indicators of services. At the architectural design stage, key SLA parameters and technical mechanisms for measuring them are determined.

Basic steps:

  1. Business-critical metrics are identified: response time, availability, error rate, recovery time.
  2. The architecture includes monitoring tools for automatic collection of these metrics.
  3. The SLA is agreed upon with the client, and the results form the basis for building monitoring and alerts.

Example of defining SLA for a web service:

  • Availability: 99.9% (downtime no more than 43 minutes per month)
  • API response time: no more than 200ms for 95% of requests
  • Error rate: no more than 0.5%

Key features:

  • SLA affects not only the technical architecture but also the operation and support processes.
  • SLA is often documented in specifications and contracts.
  • Automatic monitoring, alerts, reports allow reliable tracking of SLA compliance.

Tricky questions.

Can SLA be built solely on technical metrics (e.g., errors and response)?

Answer: Incorrect. It is also necessary to consider business metrics (e.g., the success of business operations) to ensure that the SLA meets business expectations.


Is achieving SLA a static process that does not require adjustments after the system is launched?

Answer: No. SLA is revised with changes in the business, increased load, and new requirements.


Can SLA monitoring be based solely on results from external systems (ping, http-check) without agents inside the services?

Answer: Not recommended. External monitoring is important, but internal gathering (agents collecting internal metrics) allows detecting hidden issues before they become noticeable externally.