The term Site Reliability Engineering (abbreviation: SRE) is frequently used within the framework of collaborative models which focus on cooperation and greater agility in IT. But what is this approach to IT systems and operations all about?



As is so often the case with modern process models and methodologies, the origins of Site Reliability Engineering can be traced to one of the largest American tech corporations who began work in this field in 2003 - Google. Google has always recognised that the core of their business, and thus the foundation to their success, is dependent on the reliability of their internal IT infrastructure.


For example, Google has research and development teams researching methods and process models that can ensure the existing infrastructure can withstand the rapid growth the company is experiencing. Although a strict separation between software development and IT operations has been ever-present at Google since 2003, the following key questions were asked at the time in relation to IT operations: To what extent should teams from development and operations work together and what processes are needed to ensure that any future collaboration is successful?


The above questions and the resulting answers gave rise to Site Reliability Engineering - the model was applied to the operation of IT systems at Google from this point forward. But what is SRE in reality?



Site Reliability Engineering incorporates various methods from software development or DevOps. First and foremost, SRE approaches IT operations as a task that can be achieved through appropriate implementation of software engineering; specifically, the deployment and management of systems through the use of code. In other words: Infrastructure, workflows and manual activities are automated through the use of software which in turn increases the overall reliability of the system.


In addition to the benefits of automation, systems monitoring plays a central role in Site Reliability Engineering. Key system metrics are monitored at all times and visualised via dashboards. The data collected is not only viewed reactively but is constantly analysed to the extent that system errors and weaknesses are proactively identified and remedied thanks to SRE.


The ideal Site Reliability Engineer should have a background in software development and significant IT operations experience as well as a proven aptitude for system data analysis. This is a multi-faceted skill set that allows them to focus on automating operations. They also possess the capacity to consider how best to plan and design the infrastructure required to implement the concept. They often monitor systems during operation and analyse performance in real-time; they are focused on areas which they can enhance.


As a result, Site Reliability Engineers divide their time between operational tasks and developing software and optimisation tools. In the event that they encounter an error, they work to rectify the issue and usually lead detailed follow-up discussions with those involved to find out what worked well and what needs to be improved. In addition to the above, Site Reliability Engineers acquire important, undocumented knowledge. Thanks to their insights into development and support, this knowledge is no longer hidden and isolated to specific departments but can be used across the board by all.


In addition to the core technology-oriented elements of Site Reliability Engineering described above, the process model is also based on a number of basic methodological principles, which are described in more detail below.


Site Reliability Engineering relies on a number of established methodologies that are widely used in software development. SRE also shares methodologies previously used in DevOps. Site Reliability Engineering can be distinguished from DevOps in two key aspects: firstly, SRE places reliability (of the system) above all other priorities which influences the operations and secondly, specifications must be strictly observed. Conversely, both DevOps and SRE are primarily focused on continuous monitoring and the continuous automation of processes and workflows.


A fundamental methodology that SRE follows is based on so-called positive feedback loops. As part of the process, goals are defined and measures are implemented to measure performance against reference criteria. The positive feedback loop process includes management of errors. The principle of positive feedback loops is described below in more detail.



Site Reliability Engineering defines acceptable reliability for each system by reference to defined reference standards. The "Service Level Objective" (SLO) indicates the reliability required for any given system in order to meet internal specifications or client-specific requirements. This could mean, for example, that a specific system must maintain a performance level of 90 percent for a defined time frame.


In order to determine whether this goal, and thus the predefined reliability level, is being achieved by the system, SRE uses "Service Level Indicators" (SLI). SLIs are measurement indicators that provide information on, for example, how many requests have been successful and how many of these requests have been met within the specified time frame. The SLIs then allow a qualified statement to be made as to whether the SLO is being achieved or, if not, at which point there is need for further optimisation.



Updates and new releases deliver new features, improve existing functions and patch previous security vulnerabilities. A basic problem inherent in software development is that infinite time cannot be allocated to fully test every conceivable scenario before each release. Apart from the above constraint, the question that arises in this context is whether and if perfect system reliability is reasonable or attainable.


Site Reliability Engineering works on the basic principle that there is an error budget for every system. The budget is calculated on the basis of a theoretical reliability of 100% minus the actual applied reliability and refers to a defined period, e.g. one month. So, to use the example from earlier, the error budget would be 10 percent (100 percent minus 90 percent). As a result, it would be acceptable for the corresponding functionality to have an error rate of up to 10 percent.


To this end, if the SLO is in the green, there is no need to waste time and budget chasing a theoretical idealised state of perfect reliability when using this methodology. Only in circumstances where the error budget has been used up will further action be required and improvements shall be pursued by engineers, for example, by holding back a release in order to improve reliability.



In the event that a serious system error occurs or a failure causes significant consequences, it is generally advisable to conduct an in-depth analysis of the causes as part of the follow-up. This is the only way to learn from such errors and ideally avoid similar events in the future.


One of the special features of Site Reliability Engineering is that such failure analysis takes place within a positive and collaborative framework. Instead of attributing fault to specific individuals or teams, SRE focuses on the problem and the root causes. The question for SRE is not "who is at fault for this error?" but instead "what circumstances led to this error occurring?"


This approach helps to identify underlying reasons and contributing factors that may have led to the error including insufficient information or poor operational processes. A core concept of SRE is that no individual is held accountable for an error. On the contrary, the aim is to work together so that the error can be avoided in the future by implementing appropriate measures in the present. In this way, the entire organisation can grow through each error and improve the reliability of systems in a sustainable manner.



Site Reliability Engineering is founded on a set of technical frameworks and clear principles. But what are the advantages of SRE compared to other IT operations models? Six of the most compelling arguments in favour of SRE are outlined below:

  • Improved reporting and monitoring: SRE creates greater transparency by continuously monitoring important parameters such as productivity, service status and error rate. Metrics are used to determine actionable elements (for example, average downtime), which is then specifically targeted for improvement by engineers.
  • Proactive troubleshooting: many IT organisations focus on deploying new features. However, ever-faster development and deployment carries risk. Each version has the potential to introduce new bugs and vulnerabilities. SRE actively mitigates such risks by searching for errors and problems and fixing them before they have an opportunity to impact the user.
  • Added value: a reliable IT system means less development resources are invested in troubleshooting issues. Consequently, DEV teams have more time to develop new features. SRE identifies potential problems before deployment of the solution.
  • Cultural Shift: Site Reliability Engineering has helped to raise awareness of system integrity amongst IT organisations. The continual search for potential optimisation has a positive impact on all teams involved and promotes cross-functional collaborations. The shared sense of responsibility that SRE creates actively breaks down the once infamous isolated team mentality.
  • Improved automation: Site Reliability Engineers continually strive to automate workflows wherever possible. They also apply this approach to their own work. By implementing forward-thinking resource channels they are able to continuously optimise their work processes. This gradually reduces susceptibility to errors caused by the "human" element.
  • Satisfied customers: Unlike other IT operations models, SRE focuses on improving the customer experience, it being understood here that "customer" can refer to both the internal and external customer or user of a given system. By using SLOs and SLIs, SRE is able to clearly define system reliability goals and improve customer satisfaction.



Yet another method... Site Reliability Engineering is more than a method. The approach represents an enrichment of the existing modern IT culture. The pragmatic approach effectively bridges the gap between development and IT operations. Where other approaches are theoretical or provide a framework for action, SRE is a tangible solution-oriented approach.


Thanks to the continuous monitoring and system performance analysis at the core of SRE problems are identified earlier, optimisation is advanced and reliability prioritised.


The important question remains how best to adapt the approach to an individual company and IT organisation. The principles of Site Reliability Engineering vary from hip start-ups to tech world powers à la Google, Microsoft and Co. In terms of roll-out, it is best to focus on individual SRE assets (for example, first implement a monitoring solution).


It must be noted that Site Reliability Engineering has the potential to significantly enhance IT organisations, as the approach ties them much more closely to the company's value creation.

Share this article