Site Reliability Engineering MOC

[[🍀 Home]] ➤ [[🗺️ Maps of Content]] --- - [Site Reliability Engineering at Google • Christof Leng • GOTO 2018](https://www.youtube.com/watch?v=d2wn_E1jxn4) ## Start - [[Building a Site Reliability Engineering Team]] ## News - [SRE WEEKLY – scalability, availability, incident response, automation](https://sreweekly.com/) - [SRE subreddit](https://old.reddit.com/r/sre) - [[Articles about Site Reliability Engineering]] ## Learning resources - [[SREcon]], [[ObservabilityCON]], [[Monitorama]], [[o11yfest]], [[PromCon]], [[KubeCon]] and [[SLOconf]] (see [observability.events](http://observability.events/)) - [[Site Reliability Engineering (highlights)]] - [Our favorite (top?) SRE talks](https://engineering.zenduty.com/blog/2020/07/02/our-favorite-sre-talks) - [DevOps Vs. SRE: Competing Standards or Friends? (Cloud Next '19)](https://www.youtube.com/watch?v=0UyrVqBoCAU) - [Life of an SRE at Google](https://www.youtube.com/watch?v=7Oe8mYPBZmw) - [Incident Management at GitLab](https://about.gitlab.com/handbook/engineering/infrastructure/incident-management/#corrective-actions) is very good - Same with [GitLab.com / runbooks · GitLab](https://gitlab.com/gitlab-com/runbooks) - [Source of the handbook](https://gitlab.com/gitlab-com/www-gitlab-com/-/blob/master/sites/handbook/source/handbook/engineering/infrastructure/incident-management/index.html.md) The diagram from GitLab's Incident Management Handbook: ``` mermaid graph TD A(Incident is declared) --> |initial severity assigned - EOC and IMOC are assigned| B(Incident::Active) B --> |Temporary mitigation is in place, or an alert silence is added| C(Incident::Mitigated) B --> D C --> D(Incident::Resolved) D --> |severity is re-assessed| D D -.-> |for review-requested incidents| E(Incident::Review-Completed) ``` ### Talks [Getting Started with SRE](https://www.youtube.com/watch?v=c-w_GYvi0eA). ## Concepts - [[Service level indicator]] - [Service Level Objectives (chapter)](https://sre.google/sre-book/service-level-objectives/) - Based on _data_ (which [[Prometheus]] is good for) - [[Service level objective]] relies on [[Service level indicator]] - [Service Level Objectives (chapter)](https://sre.google/sre-book/service-level-objectives/) - [Implementing SLOs (workbook chapter)](https://sre.google/workbook/implementing-slos/) - [SLO Engineering Case Studies (workbook chapter)](https://sre.google/workbook/slo-engineering-case-studies/) - [Implementing Service Level Objectives](https://learning.oreilly.com/library/view/implementing-service-level/9781492076803/) - [[OpenSLO]] and [[Sloth]] - [[SLOconf]] - [Nobl9 Service Level Objectives Platform: Reliability At Your Service](https://nobl9.com/) - [[Pyrra]] - [Promtools](https://promtools.dev/alerts/errors) based on [slo-libsonnet](https://github.com/metalmatze/slo-libsonnet) - [[Error budget]] - [[Burn rate]] of the error budget - [Alerting on SLOs (chapter)](https://sre.google/workbook/alerting-on-slos/) - [[Tracing]] - Last priority - [[Toil]] - [Eliminating toil (chapter)](https://sre.google/sre-book/eliminating-toil/) - [Eliminating toil (workbook chapter)](https://sre.google/workbook/eliminating-toil/) - [[Monitoring]] - [Monitoring (workbook chapter)](https://sre.google/workbook/monitoring/) - [[The Four Golden Signals]] - [[Dashboard]] - [[Alert]] - [[White-box monitoring]] - [[Black-box monitoring]] - [[Root cause]] - [Practical Alerting from Time-Series Data (chapter)](https://sre.google/sre-book/practical-alerting/) - [[Hierarchy of Production Needs]] - [[On-call]] - [Being on-call (chapter)](https://sre.google/sre-book/being-on-call/) - [On-Call (workbook chapter)](https://sre.google/workbook/on-call/) - [[Incident response]] - [Incident response (workbook chapter)](https://sre.google/workbook/incident-response/) - [Managing incidents (chapter)](https://sre.google/sre-book/managing-incidents/) - [Emergency response (chapter)](https://sre.google/sre-book/emergency-response/) - [Example Incident State Document (appendix)](https://sre.google/sre-book/incident-document/) - Requires [[Troubleshooting]] skills - [Effective troubleshooting (chapter)](https://sre.google/sre-book/effective-troubleshooting/) - [[Incident response lifecycle]] from the [[Computer security]] world - Kinda related to [[Timesketch]] - [[Incident Management for Operations (highlights)]] - [[Mean time to repair]] (MTTR) - [[Post mortem]] after incidents have occured - [Postmortem Culture: Learning from Failure (chapter)](https://sre.google/sre-book/postmortem-culture/) - [Postmortem Culture: Learning from Failure (workbook chapter)](https://sre.google/workbook/postmortem-culture/) - [The Human Side of Postmortems](https://learning.oreilly.com/library/view/the-human-side/9781449369538/) ## Tools and services [incident.io](https://incident.io/) does what I always wanted in Slack. [[Prometheus]] for [[Metrics]] and [[Service level indicator]]. [[OpenSLO]] with [[Sloth]]. [[Pyrra]]. [[Grafana]] for humans to have something to look at and improve thinking about a system ([[Dashboard]]) [[Jaeger]] for [[Tracing]].