- March 31, 2021
- 3 minute read
Site Reliability Engineering (SRE) is an engineering discipline with the primary goal of improving the reliability of software services. Initially developed by Google, SRE is now an industry standard, practiced in companies of all sizes. The increasing adoption of SRE best practices shows the effectiveness of this methodology in ensuring consistent uptime and availability, improving risk management, and enabling early detection of issues.
While SRE is often leveraged by DevOps teams, the terms are neither identical nor interchangeable. This article provides an in-depth overview of SRE as a methodology and SRE practices, as well as a brief SRE and DevOps comparison.
In this article, you will learn:
SRE was initially developed by Google and later detailed in an SRE guide they wrote and made public. The guide outlines how Google applies an SRE strategy and suggests a methodological blueprint for other organizations. However, at present, SRE strategies vary widely by organization and practices, which is why site reliability engineer roles may differ according to company size and industry. A similar discipline was developed in parallel by Facebook, called Production Engineering. Currently, both disciplines seem to have merged to form a single approach.
Historically, the goal of ensuring software reliability was met in different ways, primarily under the responsibility of Software Architects, DevOps, and old-school Operations engineers. But there are significant advantages dedicated site reliability engineers can bring to the table:
Site reliability engineering is often misperceived as an evolution of DevOps. In reality, it is a practical implementation of DevOps principles. Just as continuous integration and continuous delivery are applications of DevOps principles to software release, SRE is an application of these same principles to software reliability.
According to Google’s approach, you can use SRE to better adopt DevOps principles in the organization and measure your implementation’s success.
To better understand how to combine the two, consider the following principles:
When implementing SRE, it may take you some time to refine your strategy and customize practices to meet your operational needs. To help speed up this process, consider adopting the following SRE principles and best practices.
Teams should work to evaluate all changes to understand how that change may impact other systems or processes. This means understanding any dependencies on that change and how those dependencies may chain throughout your operations.
Additionally, teams should evaluate both short term and long term impacts. If a change causes short team performance losses but long term gains, your engineers need to weigh this value accordingly.
Successful SRE implementations depend on highly and diversely skilled engineers and architects. Additionally, because environments and operations are dynamic, you need engineers who are constantly expanding their skillsets and expertise.
This requirement means encouraging training and professional development. It also means that you may want to consider less traditional backgrounds in your team makeups to include hard to access expertise.
Implement automation early on and build from a stance that supports future automation. For example, developing basic infrastructure templates from the start that can be modified as needed. Your goal should be to reduce duplication or redundancy of work as much as possible. While this requires extra work up front, it can save significant effort and time later on.
Embrace postmortems as opportunities for learning and insight. When your teams can discuss and review incidents without blame, they are better able to identify issues objectively and identify areas of lacking knowledge or skill. In turn, this helps teams identify gaps that need to be addressed to improve overall performance and quality.
To ensure high-quality service, you need to understand what your users want and need. One way to do this is to focus on defining SLOs from the perspective of the end-user. For example, focusing on request latency on the client-side rather than the server-side. By focusing on client perspectives, you reduce the chance that your improvement efforts will go unappreciated or unseen.
Business vector created by jcomp - www.freepik.com