Etiquette for the Burning Building
They look like manners. They are not.
What follows is not a framework. There is no certification at the end of it. These are the rules I wish someone had handed me in the first week of running major incidents - rules I have since watched colleagues learn one by one, at the cost of one Robert at a time. They look like manners. They are not.
On bridges and the people who join them
If you are an IC and you do not control the bridge, the bridge controls you. Set the cadence, name the speakers, call the stand-downs. Otherwise the loudest voice runs the response, and the loudest voice is rarely the right one.
The executive who joins the bridge to “help” is not helping. The executive who joins the bridge to “observe” is not observing. There is no observer mode. Every additional person on a bridge costs the IC roughly 8% of her remaining cognitive function, and she has already spent the other 92% on you.
The head of engineering who is irritating everyone by ensuring things are truly fixed and every base is covered is doing the work nobody else will. It is not pleasant. It is thorough.
If you join the bridge, say your name and your function within ten seconds. The IC is not a clairvoyant. She is tracking six responders, a Splunk dashboard, and a Senior Director who keeps unmuting to ask “where are we at.”
Scribe as you go. The first ten minutes of an incident are seven different teams logging on and asking the same three questions. If the answers are not already in the channel, you will be repeating them yourself, in real time, while also running the response.
Do not engage an individual directly. Always use the on-call roster, lest you disturb Robert for the seventh time that will likely result in his departure from the company three weeks later.
PR and Legal get a bridge of their own. They serve a real purpose, and that purpose is not asking the engineers what they were thinking at 2:14am while they are still thinking it. Once the technical bridge becomes a discussion of who knew what and when, it is no longer a war room. It is a deposition.
The senior engineer who knows the answer but doesn’t speak up because he’s “not on this rotation” is not being humble. He is being expensive. Speak up or log off.
Never assume during an incident. There is no such thing as a stupid question. There are only stupid assumptions, made by people who did not ask one.
Fifteen minutes. If a paged engineer hasn’t responded in fifteen, escalate. The clock is not a moral instrument; every engineer has missed a page. Anna once waited twenty-eight minutes for Damien out of politeness - the customer noticed at minute forty-one. Damien had been mowing the lawn.
On the language of incidents
Do not type “should be resolved” in any channel. “Should” is an admission. “Is” is a commitment. Pick one and live with it.
Warm handovers only. Revenge is a dish best served cold; a handover is not.
An ETA is not an estimate. An ETA is a vow. Do not bring an ETA into a war room unless you intend to be married to it, in sickness and in 4am Slack pings, until rollback do you part.
“Quick question” is reserved for things that are both quick and questions. Almost nothing qualifies. Almost nothing.
Assumptions will ruin you. The quickest way to surface the right answer is to loudly proclaim - or scribe - the wrong one. The corrections arrive fast.
Every engineer eventually causes a major incident. It is the rite of passage. Affix no blame while the fix is happening. Everyone is human.
If a service falls over and nobody is told, does it make a sound? Yes - louder than the outage, and longer-lasting. Send the comms before the fix lands, when it lands, and on resolution. Silence is not modesty. It is a second incident, and you do not control it.
When in doubt, shotgun. Page every team that could plausibly own the fault and let them stand down as they clear themselves. The alternative is finding the right team at minute eighty-nine, having spent the first eighty-eight on a polite tour of the wrong on-call rosters.
On the politics of severity
Severity is a description, not a negotiation. After mitigation, a Sev 1 becomes a Sev 2 - the bleeding has stopped, and the work that remains needs hours rather than a war room. Before mitigation, talking it down because "we can manage it in business hours" is administrative violence performed with a calendar invite.
Pages have a half-life. Every Sev 1 that turns out to be a Sev 3 increases the response time on the next real Sev 1, and on the one after that.
Mitigated is not resolved. Mitigated means the bleeding has stopped. Resolved means there are no loose ends. Close at mitigation and you will reopen the same incident in two hours’ time.
Do not resolve the incident until the customer confirms it is resolved. Until then, you have only resolved the symptom you can see from where you are standing, which is rarely where the customer is standing.
A Sev 1 called at 4am is a parachute pull, not an escalation. Do not ask why it wasn’t called at 2am - that question is for the retro, and the retro will be brutal enough. Maria got asked it on the bridge once. Maria now works in product management.
On heroes and the cost of them
No incident process or toolset is ever good. Get the duct tape, grit your teeth, and run the response with what you have. The perfect tool is always two quarters away.
Heroes get singled out. Heroes get burnt out. Heroes leave. If you are watching one person fix the incident alone, you are not running a war room - you are running a hospice, and the patient is your retention rate.
The IC is also the scribe. Every off-topic message in the channel is a tax on her bandwidth and a hole in the timeline. The gaps in the PIR you’ll skim in two weeks are not Priya’s failure. They are yours.
Twelve hours. No engineer stays on a bridge longer than that. After twelve, they are not an engineer. They are a liability with a Slack handle and a degraded sense of what “safe to deploy” means at 3am. Owen made it to hour nineteen. He authorised the rollback that became INC-1843. Owen has not returned.
On what comes after
Incident metrics are never accurate at resolution. If you do not revisit them during the PIR, congratulations - your metrics are a fable, and the moral is whatever the dashboard says it is.
Blameless does not mean toothless. A PIR that cannot say the word “we” has nothing to say at all.
A PIR without action items, owners, and due dates is theatre. "We will learn from this one" is a New Year's resolution - sincere in January, gone by February, repeated word-for-word at the next PIR.
The post-incident review is not a trial. The post-incident review is also not a group hug. It is what the Greeks would have called catharsis, if the Greeks had ever had to roll back a deployment at 11pm on a Friday.
Close a bridge without a debrief and you have built a boomerang. Confirm who owns what, write it down, then close. Real boomerangs return to the thrower; this one returns to whoever is on call next.
None of this is etiquette in the way the word usually means it. The rules exist because Maria now works in product management, Owen never came back from hour nineteen, and Robert left three weeks after the seventh disturbance. Politeness - the well-meaning, professional, didn't-want-to-bother kind - is what put them there. The protocol is what's left when you take it out. Print the list. There will be another Robert. The list cannot save them all, but it can save the one whose name you have not learned yet.





