What is an error budget—and why does it matter? | Atlassian (2024)

Incident management for high-velocity teams

Get it free

Learn more

Service Request Management

Overview

Best practices for building a service desk

IT metrics and reporting

SLAs: The What, the Why, the How

Why first call resolution matters

Help desk

Service desk vs help desk vs ITSM

How to run IT support the DevOps way

Conversational ticketing

Customize Jira Service Management

Transitioning from email support

Service Catalog

What is a virtual agent

IT Asset Management

Overview

Configuration management databases

Configuration vs Asset Management

Incident Management

Overview

IT service continuity management

Incident Communication

Templates

Workshop

Incident Response

Best Practices

Incident Commander

Aviation

Roles and responsibilities

Lifecycle

Playbook

On call

On call schedules

On call pay

Alert fatigue

Improving on call

IT alerting

Escalation Policies

Tools

Template

KPIs

Common metrics

Severity levels

Cost of downtime

SLA vs. SLO vs. SLI

Reliability vs. availability

MTTF (Mean Time to Failure)

DevOps

SRE

You built it, you run it

Problem management vs. incident management

ChatOps

ITSM

Major incident management

IT incident management

Modern incident management for IT ops

Disaster recovery plans for IT ops and DevOps pros

Bug tracking best practices

Postmortem

Template

Blameless

Reports

Meeting

Timelines

5 whys

Public vs. private

Tutorials

Incident communication

On call schedule

Automating customer notifications

Handbook

Incident response

Postmortems

Template generator

Glossary

Get the handbook

2020 State of Incident Management

2021 State of Incident Management

IT Management

Overview

Problem Management

Overview

Template

Roles and responsibilities

Process

Change Management

Overview

Best practices

Roles and responsibilities

Change advisory board

Change management types

Knowledge Management

Overview

What is a knowledge base

What is knowledge-centered service (KCS)

Self-service knowledge bases

Enterprise Service Management

Overview

HR Service Management and Delivery

HR Automation best practices

Three implementation tips for ESM

ITIL

Overview

DevOps vs ITIL

ITIL Service Strategy Guide

ITIL service transition

Continual service improvement

IT Operations

Overview

IT Operations Management

Overview

System Upgrade

Service mapping

Application dependency mapping

Every development, operations, and IT team knows that sometimes incidents happen.

Even the biggest companies with the brightest talent and a reputation for nearly 100% uptime sometimes watch in frustration as their systems go down. Just look at Apple, Delta, or Facebook, all have lost tens of millions to incidents in the past five years.

This reality means Service Level Agreements (SLAs) should never promise 100% uptime. Because that’s a promise no company can keep.

It also means that if your company is very good at avoiding or resolving incidents, you might consistently knock your uptime goals out of the park. Perhaps you promise 99% uptime and actually come closer to 99.5%. Perhaps you promise 99.5% uptime and actually reach 99.99% on a typical month.

When that happens, industry experts recommend that instead of setting user expectations too high by constantly overshooting your promises, you consider that extra .99% an error budget—time that your team can use to take risks.

What is an error budget?

An error budget is the maximum amount of time that a technical system can fail without contractual consequences.

For example, if your Service Level Agreement (SLA) specifies that systems will function 99.99% of the time before the business has to compensate customers for the outage, that means your error budget (or the time your systems can go down without consequences) is 52 minutes and 35 seconds per year.

If your SLA promises 99.95% uptime, your error budget is four hours, 22 minutes, and 48 seconds. And with an SLA promise of 99.9% uptime, your error budget is eight hours, 46 minutes, and 12 seconds.

Why do tech teams need error budgets?

At first glance, error budgets don’t seem that important. They’re just another metric IT and DevOps need to track to make sure everything’s running smoothly, right?

The answer, fortunately, is no. Error budgets aren’t just a convenient way to make sure you’re meeting contractual promises. They’re also an opportunity for development teams to innovate and take risks.

As we explain in our SRE article,

“The development team can ‘spend’ this error budget in any way they like. If the product is currently running flawlessly, with few or no errors, they can launch whatever they want, whenever they want. Conversely, if they have met or exceeded the error budget and are operating at or below the defined SLA, all launches are frozen until they reduce the number of errors to a level that allows the launch to proceed.”

The benefit of this approach is that it encourages teams to minimize real incidents and maximize innovation by taking risks within acceptable limits. It also bridges the gap between development teams, whose goals are innovation and agility, and operations, who are concerned with stability and security. As long as downtime remains low, developers can remain agile and push changes without friction from operations.

How to use an error budget

First, you’ll need to consult your SLAs and SLOs. What objectives have you already set for uptime or successful system requests? What promises has your company made to clients? Those will dictate your error budget.

Error budgets based on uptime

Most teams monitor uptime on a monthly basis. If availability is above the number promised by the SLA/SLO, the team can release new features and take risks. If it’s below the target, releases halt until the target numbers are back on track.

To use this method effectively, you’ll need to translate your SLO target (usually a percentage) into real figures your developers can work within. This means calculating how many hours and minutes your 1% or .5% or .1% of allowed downtime actually translates to. Common targets include:

SLA target

Yearly allowed downtime

Monthly allowed downtime

99.99% uptime

Yearly allowed downtime

52 minutes, 35 seconds

Monthly allowed downtime

4 minutes, 23 seconds

99.95% uptime

Yearly allowed downtime

4 hours, 22 minutes, 48 seconds

Monthly allowed downtime

21 minutes, 54 seconds

99.9% uptime

Yearly allowed downtime

8 hours, 45 minutes, 57 seconds

Monthly allowed downtime

43 minutes, 50 seconds

99.5% uptime

Yearly allowed downtime

43 hours, 49 minutes, 45 seconds

Monthly allowed downtime

3 hours, 39 minutes

99% uptime

Yearly allowed downtime

87 hours, 39 minutes

Monthly allowed downtime

7 hours, 18 minutes

Error budgets based on successful requests

SLOs get less hate than SLAs, but they can create just as many problems if they’re vague, overly complicated, or impossible to measure. The key to SLOs that don’t make your engineers want to tear their hair out is simplicity and clarity. Only the most important metrics should qualify for SLO status, the objectives should be spelled out in plain language, and, as with SLAs, they should always account for issues such as client-side delays.

Stay on top of SLAs to resolve requests based on priorities, and use automated escalation rules to notify the right team members and prevent SLA breaches with Jira Service Management.

Try Jira Service Management free

Tutorial

Learn incident communication with Statuspage

In this tutorial, we’ll show you how to use incident templates to communicate effectively during outages. Adaptable to many types of service interruption.

Read this tutorial

Up next

The importance of an incident postmortem process

An incident postmortem, also known as a post-incident review, is the best way to work through what happened during an incident and capture lessons learned.

Read this article

Up Next

DevOps

What is an error budget—and why does it matter? | Atlassian (2024)
Top Articles
Latest Posts
Article information

Author: Reed Wilderman

Last Updated:

Views: 6325

Rating: 4.1 / 5 (52 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Reed Wilderman

Birthday: 1992-06-14

Address: 998 Estell Village, Lake Oscarberg, SD 48713-6877

Phone: +21813267449721

Job: Technology Engineer

Hobby: Swimming, Do it yourself, Beekeeping, Lapidary, Cosplaying, Hiking, Graffiti

Introduction: My name is Reed Wilderman, I am a faithful, bright, lucky, adventurous, lively, rich, vast person who loves writing and wants to share my knowledge and understanding with you.