Step 1: turn off your pager. Step 2: disable the monitoring system. Or.... you can run oncall using modern methodologies that constantly improve the reliability of your system.
I'm teaching a tutorial at Usenix LISA called "How To Not Get Paged: Managing Oncall to Reduce Outages".
I'm excited about this class because I'm going to explain a lot of the things I learned at Google about how to turn oncall from a PITA to a productive use of time that improves the reliability of the systems you run. Most of the material is from our new book, The Practice of Cloud System Administration, but the Q&A always leads me to say things I couldn't put in print.
Seating is limited. Register now!
How To Not Get Paged: Managing Oncall to Reduce Outages
Who should attend:
Anyone with an oncall responsibility (or their manager).
When/Where
Tuesday, 11-Nov, 1:30pm-5pm at Usenix LISA
Description:
People think of "oncall" as responding to a pager that beeps because of an outage. In this class you will learn how to use oncall as a vehicle to improve system reliability so that you get paged less often.
Take back to work:
- How to monitor more accurately so you get paged less
- How to design an oncall schedule so that it is more fair and less stressful
- How to assure preventative work and long-term solutions get done between oncall shifts
- How to conduct "Fire Drills" and "Game Day Exercises" to create antifragile systems
- How to write a good Post-mortem document that communicates better and prevents future problems
Topics include:
- Why your monitoring strategy is broken and how to fix it
- Building a more fair oncall schedule
- Monitoring to detect outages vs. monitoring to improve reliability
- Alert review strategies
- Conducting "Fire Drills" and "Game Day Exercises"
- "Blameless Post-mortem documents"
Leave a comment