Site Reliability Engineering (book)

# SRE The first book on [[Site Reliability Engineering]] by Google. ## The > By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload. Eventually, a traditional ops-focused group scales linearly with service size: if the products supported by the service succeed, the operational load will grow with traffic. That means hiring more people to do the same tasks over and over again. >To avoid this fate, the team tasked with managing a service needs to code or it will drown. Therefore, **Google places a 50% cap on the aggregate “ops” work for all SREs**—tickets, on-call, manual tasks, etc. This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable. This cap is an upper bound; over time, left to their own devices, the SRE team should end up with very little operational load and almost entirely engage in development tasks, because the service basically runs and repairs itself: we want systems that are _automatic_, not just _automated_. In practice, scale and new features keep SREs on their toes. ## The error budget > The use of an error budget resolves the structural conflict of incentives between development and SRE. SRE’s goal is no longer “zero outages”; rather, SREs and product developers aim to **spend the error budget getting maximum feature velocity**. This change makes all the difference. An outage is no longer a “bad” thing—it is an expected part of the process of innovation, and an occurrence that both development and SRE teams manage rather than fear.