[[🍀 Home]] ➤ [[🗺️ Maps of Content]]
---
- [Site Reliability Engineering at Google • Christof Leng • GOTO 2018](https://www.youtube.com/watch?v=d2wn_E1jxn4)
## Start
- [[Building a Site Reliability Engineering Team]]
## News
- [SRE WEEKLY – scalability, availability, incident response, automation](https://sreweekly.com/)
- [SRE subreddit](https://old.reddit.com/r/sre)
- [[Articles about Site Reliability Engineering]]
## Learning resources
- [[SREcon]], [[ObservabilityCON]], [[Monitorama]], [[o11yfest]], [[PromCon]], [[KubeCon]] and [[SLOconf]] (see [observability.events](http://observability.events/))
- [[Site Reliability Engineering (highlights)]]
- [Our favorite (top?) SRE talks](https://engineering.zenduty.com/blog/2020/07/02/our-favorite-sre-talks)
- [DevOps Vs. SRE: Competing Standards or Friends? (Cloud Next '19)](https://www.youtube.com/watch?v=0UyrVqBoCAU)
- [Life of an SRE at Google](https://www.youtube.com/watch?v=7Oe8mYPBZmw)
- [Incident Management at GitLab](https://about.gitlab.com/handbook/engineering/infrastructure/incident-management/#corrective-actions) is very good
- Same with [GitLab.com / runbooks · GitLab](https://gitlab.com/gitlab-com/runbooks)
- [Source of the handbook](https://gitlab.com/gitlab-com/www-gitlab-com/-/blob/master/sites/handbook/source/handbook/engineering/infrastructure/incident-management/index.html.md)
The diagram from GitLab's Incident Management Handbook:
``` mermaid
graph TD
A(Incident is declared) --> |initial severity assigned - EOC and IMOC are assigned| B(Incident::Active)
B --> |Temporary mitigation is in place, or an alert silence is added| C(Incident::Mitigated)
B --> D
C --> D(Incident::Resolved)
D --> |severity is re-assessed| D
D -.-> |for review-requested incidents| E(Incident::Review-Completed)
```
### Talks
[Getting Started with SRE](https://www.youtube.com/watch?v=c-w_GYvi0eA).
## Concepts
- [[Service level indicator]]
- [Service Level Objectives (chapter)](https://sre.google/sre-book/service-level-objectives/)
- Based on _data_ (which [[Prometheus]] is good for)
- [[Service level objective]] relies on [[Service level indicator]]
- [Service Level Objectives (chapter)](https://sre.google/sre-book/service-level-objectives/)
- [Implementing SLOs (workbook chapter)](https://sre.google/workbook/implementing-slos/)
- [SLO Engineering Case Studies (workbook chapter)](https://sre.google/workbook/slo-engineering-case-studies/)
- [Implementing Service Level Objectives](https://learning.oreilly.com/library/view/implementing-service-level/9781492076803/)
- [[OpenSLO]] and [[Sloth]]
- [[SLOconf]]
- [Nobl9 Service Level Objectives Platform: Reliability At Your Service](https://nobl9.com/)
- [[Pyrra]]
- [Promtools](https://promtools.dev/alerts/errors) based on [slo-libsonnet](https://github.com/metalmatze/slo-libsonnet)
- [[Error budget]]
- [[Burn rate]] of the error budget
- [Alerting on SLOs (chapter)](https://sre.google/workbook/alerting-on-slos/)
- [[Tracing]]
- Last priority
- [[Toil]]
- [Eliminating toil (chapter)](https://sre.google/sre-book/eliminating-toil/)
- [Eliminating toil (workbook chapter)](https://sre.google/workbook/eliminating-toil/)
- [[Monitoring]]
- [Monitoring (workbook chapter)](https://sre.google/workbook/monitoring/)
- [[The Four Golden Signals]]
- [[Dashboard]]
- [[Alert]]
- [[White-box monitoring]]
- [[Black-box monitoring]]
- [[Root cause]]
- [Practical Alerting from Time-Series Data (chapter)](https://sre.google/sre-book/practical-alerting/)
- [[Hierarchy of Production Needs]]
- [[On-call]]
- [Being on-call (chapter)](https://sre.google/sre-book/being-on-call/)
- [On-Call (workbook chapter)](https://sre.google/workbook/on-call/)
- [[Incident response]]
- [Incident response (workbook chapter)](https://sre.google/workbook/incident-response/)
- [Managing incidents (chapter)](https://sre.google/sre-book/managing-incidents/)
- [Emergency response (chapter)](https://sre.google/sre-book/emergency-response/)
- [Example Incident State Document (appendix)](https://sre.google/sre-book/incident-document/)
- Requires [[Troubleshooting]] skills
- [Effective troubleshooting (chapter)](https://sre.google/sre-book/effective-troubleshooting/)
- [[Incident response lifecycle]] from the [[Computer security]] world
- Kinda related to [[Timesketch]]
- [[Incident Management for Operations (highlights)]]
- [[Mean time to repair]] (MTTR)
- [[Post mortem]] after incidents have occured
- [Postmortem Culture: Learning from Failure (chapter)](https://sre.google/sre-book/postmortem-culture/)
- [Postmortem Culture: Learning from Failure (workbook chapter)](https://sre.google/workbook/postmortem-culture/)
- [The Human Side of Postmortems](https://learning.oreilly.com/library/view/the-human-side/9781449369538/)
## Tools and services
[incident.io](https://incident.io/) does what I always wanted in Slack.
[[Prometheus]] for [[Metrics]] and [[Service level indicator]].
[[OpenSLO]] with [[Sloth]]. [[Pyrra]].
[[Grafana]] for humans to have something to look at and improve thinking about a system ([[Dashboard]])
[[Jaeger]] for [[Tracing]].