- March 31, 2021
- 3 minute read
You’ve heard of playbooks. But what about playbooks-as-code? How can playbooks be managed as code, and what does that mean for SREs and incident response teams?
The short answer is that by automating the processes that are defined in conventional playbooks, playbooks-as-code take incident response and reliability engineering to the next level.
For the longer answer, keep reading. This article offers an overview of playbooks-as-code, including how they work and the benefits they offer.
In incident response, a playbook is a set of rules or procedures that define how a team should operate when reacting to a given type of incident or event.
The purpose of playbooks is to provide consistency in the incident response process. Playbooks help ensure that all of the engineers on your team will take the same approach to resolving a given type of problem.
In addition, playbooks help remove some of the guesswork and manual troubleshooting that would otherwise occur during the incident response process. They bring greater foresight and planning to incident response by laying out the most likely root cause of various types of events, and explaining exactly how to solve it. In this way, they enable faster and more reliable resolutions.
Conventional playbooks come in the form of documents, or (at best) a set of procedures that is integrated into an incident response platform. In other words, they’re basically just lists.
Having this type of playbook on hand is better than nothing when you’re troubleshooting a problem. But traditional playbooks are subject to several major drawbacks:
Manual operations: There is no way to trigger or advance a traditional playbook automatically. Engineers have to open it and follow the steps one-by-one.
Lack of collaboration: Most playbooks are designed to help IT engineers or SREs solve specific problems. They are not designed to bring in other stakeholders (like developers) who may need to collaborate on a given issue. If that collaboration becomes necessary, it must be coordinated manually.
Difficulty of interpretation: Playbooks are designed to be easy to follow, but they often aren’t, in practice. They can be hard to interpret, and they sometimes assume background knowledge or domain expertise that on-call engineers may lack when following a playbook.
Playbooks-as-code solve these issues by using software to define and enforce the processes within incident response workflows.
In other words, instead of just using words to describe what engineers should do, playbooks-as-code leverage code and an enforcement engine to automate the process as much as possible. They can trigger software tools to perform certain actions on their own, instead of waiting for engineers to do it. They can also collect data and monitor the status of the incident response process to ensure that the playbook is being followed properly.
The benefits of automating playbooks through code are numerous:
The idea of automating incident response via playbooks-as-code may sound intimidating. Which coding framework do you use? What does a playbook-as-code file actually look like?
Some SRE platform providers offer a repository of playbooks-as-code which you can view for reference. They might even offer specific playbooks-as-code for automating workflows like setting up incident war rooms or troubleshooting Kubernetes pod issues.
Ordinary playbooks can provide some level of consistency to incident response operations, but they fall short of delivering the fast and seamless experience that teams need to minimize MTTR and deliver the highest ROI in incident response resources for the business. By automating incident response, playbooks-as-code address these shortcomings, enabling teams to work faster, more collaboratively and with better results than ever.
Banner vector created by roserodionova - www.freepik.com