SRE is what happens when you ask a software engineer to design an operations team. It is an implementation of the [[DevOps]] [[Paradigm|paradigm]].[^1] Site Reliability Engineering was coined by [[Benjamin Treynor]]. Before SRE, it was hard to find voices in the operations landscape. The only book I could find on the subject was a book from 2012 called [Effective Monitoring and Alerting]( In 2016, [Google published their seminal work on SRE for free online]( It made its rounds on Hacker News and Reddit. It was a revelation. The book was called [Site Reliability Engineering]( and lay the groundwork and finally gave us some philosophical underpinnings on the challenges of keeping systems running. > Hope is not a strategy Google on SRE: > SRE is what you get when you treat operations as if it’s a software problem. Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services — Google Search, Ads, Gmail, Android, YouTube, and App Engine, to name just a few — with an ever-watchful eye on their availability, latency, performance, and capacity. Later on I found [Practical Monitoring]( which was published in 2017. I attended the [[Monitorama]] conference in Amsterdam in 2018. Things have certainly matured. These days there is a wealth of knowledge available. ## Links - [Interview with a Site Reliability Engineer]( (HashiCorp employee). - [SRE at Google]( ## Books See [[Going on a Safari#Operations and monitoring]]. [[Seeking SRE (book)]] is very similar to [[Tribe of Hackers Blue Team (book)]]. It's a book full of interviews. ## Conferences - [SREcon]( - [[SREcon]] - [Monitorama]( - [[Monitorama]] - [SLOconf]( - [[SLOconf]] - [SLOconf YouTube playlist]( [^1]: [LISA18 - SRE (and DevOps) at a Startup - YouTube](