Service level agreements (SLAs) first became popular with fixed-line telecom companies in the late 1980s. For the last 20 years, billboards with five nines (99.999%) have peppered every interstate in major US metros. But are numbers of nines in an SLA the right metrics for how reliability should be communicated within an organization and externally to customers today?
SLAs exist for a reason: lawyers. Anyone entering into a services contract needs a way out if the provider does not perform.
We all know the dance that happens with SLAs in contract cycles. The customer’s legal team (along with the procurement team) want as many guaranteed nines as possible, and the service provider’s operations staff want to hard commit to as few nines as possible. Usually, customers negotiate a clawback or credit for missed SLAs.
If the service provider achieves all of the nines, they get to keep all of the revenue, even if the customer isn’t truly satisfied with the service. If they miss by a little bit, the customer maybe gets 10 percent back. If they miss by a lot, maybe the customer gets 50 percent back, or they get to exit the contract and seek another provider. In any case, the customer would have preferred to have the service provider meet the SLA.
These contractual SLA obligations trickle down as performance, reliability, uptime, and responsiveness targets within the service provider’s organization. And as a consequence, the thought process around reliability has become so defensive that its most popular metrics (mean time between failures, mean time to resolution) are overwhelmingly focused on total avoidance of downtime and the fastest possible incident resolution at all costs.
The SLA doesn’t answer the question: At what point can you stop over-achieving the SLA because the customer is actually satisfied?
You can’t model user happiness with SLAs
There is a sweet spot in delivering cloud services: You want to find the perfect place where you are shipping new features (that attract and delight users) at a quick tempo while maintaining the reliability of service that keeps your existing users happy. SLAs do not allow you to divine this sweet spot.
When you are overly focused on an unrealistic, too-many-nines goal of SLA perfection, there are significant consequences in terms of time, cost, and engineering burnout. It’s expensive to try to be perfect! Users can suffer from too much reliability, through the slow addition of features that they want. And that can translate into customer churn.
On the other end of the spectrum, when you are shipping new features too fast and your software gets buggy, you may be maintaining your SLA target number of nines, but that .0001% that you missed may apply to your most important customer. The fact that a service is down is one simple metric — but SLAs tell you nothing about how that outage actually affected your users.
SLAs also don’t hold up well in today’s distributed systems, where it’s much trickier to define user success across complex workflows. Even something as commonplace as a password reset traverses a web application, an API, third-party email providers, the public internet, and the user’s machine. Not only do number of different systems need to work correctly, but the process is contingent on the user completing several steps. SLAs provide no way of modeling success rates for these types of integrated systems and nuanced workflows. (And password reset is one of the simpler examples.)
Finer-grained reliability metrics with SLOs
Service level objectives (SLOs) are a math-based discipline that enables developers to model more granular reliability goals for cloud services. They give application owners a way to take the expected behavior of cloud services, and to codify outcomes in a way that can be measured (via service level indicators) and tuned over time.
SLOs feed into error budgets that allow engineering teams a specific amount of leeway in reliability goals. This gives developers and businesses a common ground for seeing the consequences of how reliability degradation is actually affecting user happiness, and more dials to turn to find the sweet spot of development speed vs. reliability.
Born out of SRE practices at Google, SLOs sit above application performance monitoring and logging tools, and put that telemetry data into the context of customer outcomes. Rather than freaking out over each abnormality detected by the monitoring systems, now you can make informed decisions with shared data in the context of the service health thresholds and objectives that you defined.
SLOs are a vehicle to work through a continuous process that makes reliability the centerpiece of your most critical customer-facing cloud services. You still need logs, metrics, traces, and everything you needed in the past—but SLOs augment those with the perspective of your team’s modeling of expected user experiences with your cloud services.
SLOs solve a critical gap between SLAs (overly specific), monitoring data (overly noisy), and the context that developers, operators, and business silos need to understand when it actually matters that a service’s reliability has dropped.
Getting started with your SLOs
The adoption of a new technology practice within a company doesn’t happen by magic. And it certainly doesn’t happen by talking about it in meetings. Some organizations have taken more governance-based approaches to encouraging SLO adoption, while others have driven adoption through socio-technical approaches.
You might be wondering where to start. Here’s an outline of how you could approach your first SLO-setting discussion with your development and operations teams:
- Share a user story. Suppose you have an e-commerce user story that says the user expects to be able to add things to their cart and immediately check out. Your user has a certain latency threshold for checkout, and when checkout takes longer than that, your user gets upset and abandons their cart.
- Phrase this customer experience issue more precisely as an SLO. What proportion of users should be able to add items to their cart and check out within x amount of time?
- Identify and quantify the risks. What happens if a customer isn’t able to check out within that time frame? What does it cost when the SLO is missed?
- Brainstorm the risk categories together. What are the things that can go wrong that would cause you not to be able to meet the SLO? Your team will respond with a wide variety of risks, likely including “Our underlying infrastructure might go down,” “Maybe we pushed a buggy update,” “We didn’t anticipate so much demand all at once,” etc.
- Ask “How could we mitigate these risks?” When considering the resources and costs required to mitigate the risk versus the cost of failure, what do you leave to chance and what do you try to address up front? Use this information to determine the service level indicators (SLIs) you will use to measure and track your ability to meet the SLO.
Someday I hope to see service providers touting reasonable SLOs on their billboards.
Alex Nauda is CTO of Nobl9. His career began as a database architect, all the way back in the days of magnetic storage and backplanes. His career-long experiences juggling product development pressures with the demands of service delivery to millions of users made him an instant fan of service level objectives and their potential to bring math discipline and quantitative metrics to site reliability. Alex lives in Boston where he grows vegetables under LEDs and teaches juggling at a non-profit community circus school.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to firstname.lastname@example.org.
Copyright © 2021 IDG Communications, Inc.