The Room and the Patient
On incident commanders, operating theatres, and the discipline we have not built.
It was morning, my time. For the customer it was the back end of an afternoon at the end of several days that had not gone well. The incident had been running long enough that nobody on the bridge could remember which version of the timeline was current, only that it had been grinding along for a few days now without a fix and the room had stopped treating it as urgent.
There were reasons for this and the reasons were structural. A single-tenant issue does not look like an incident to a system whose metrics are calibrated to blast radius. The dashboards do not light up. The all-hands channels do not move. Only one tenant, the team had said at some point in the previous days. Not unkindly. Not maliciously. Accurately. In the language another industry would use it would have been only one patient, and an emergency room with one patient runs at a different tempo than an emergency room with forty. The room’s tempo adjusts to its own load. The room knew its load to be small.
We do not bring customers onto technical bridges. This is policy and there are reasons for the policy. The reasons are that customers, when present, ask questions that the engineering team cannot answer while still solving the problem, and that the presence of an angry account on the call corrodes the focus required to do the work. The policy is correct.
The team in the room was good. I want this on the record. The engineers were senior, the work was real, the problem was not trivial. They were debugging in the way that twenty years of practice had taught them to debug. They were communicating with each other in the channel in technical detail and at appropriate intervals. They were doing all of this at the cadence the room considered appropriate to the size of the room’s problem, which is to say they were not hurrying, because the room’s problem was not large. By any measure the room had its own metrics for, the room was working.
What was not working was anything the metrics did not see. The customer had not been spoken to in several hours. Not, at any rate, in the way a customer four days into an unresolved incident needs to be spoken to. The customer service representatives joining the bridge to ask for updates were being given updates that were technically accurate and operationally useless, because the engineers were oriented toward the fix and the comms were a thing the engineers did between debugging steps when they remembered to. The reps then took the technical updates back to the account, where they were translated and softened and stripped of the specifics that might have made them informative, and then relayed to a senior contact at the customer who had been on the phone, on and off, for multiple working days.
I joined the bridge after the head of customer service and support called me into it. They had picked up the customer’s call directly, several minutes earlier, after a senior contact at the account had bypassed every reasonable escalation path the company offered and gone directly to the senior-most person whose job title contained the word customer. By the time I joined the call they were already on it.
They were cracking skulls. Diplomatically - they were choosing their words carefully, carefully enough to stay short of the line that would have required a follow-up conversation with HR, but not so carefully that anyone on the bridge could pretend they did not understand what was being communicated. The questions they were asking were structured around customer experience and resolution timeline. These were not, technically, accusations. The engineers in the room understood them as accusations, because they were.
I have a policy of not bigfooting incident calls. The incident commander is supposed to have authority and I am supposed to leave it with them. The policy is one I still defend in most circumstances. It also meant, in the days leading up to this bridge, that I had not been watching it. There were other bridges. There were larger incidents. Somewhere in the structure that allocated my attention there was a working assumption that a bridge running this slowly did not need senior process attention, because if it had needed it the metrics would have said so. The metrics had not said so. Until somebody picked up a phone, neither had I.
The fix was finalized within a few days. The metrics will record it as a SEV-2 with an extended duration but a clean technical resolution. The bridge had ended in success.
There have been others. The names are different and the systems are different and the day of the week is different but the shape of the failure is not. A room calibrating its tempo to its own metrics. A patient outside the room. A comms chain that loses fidelity at each translation. An executive whose unannounced arrival is what produces movement. Perhaps four hundred of them, by now, across a career. I have stopped counting.
What was missing from those bridges has a name. We have not been using it.
The role is incident commander. The name is in widespread use. What is not in widespread use is the discipline the name is supposed to describe.
We use the title two ways. One is the senior engineer or engineering manager who is on-call when the page lands and is therefore, by default, the person on whom coordination falls. The other is a person whose entire role is to coordinate the incident - whose authority derives from the role rather than from seniority, and whose orientation is toward the patient rather than the room. The first version is a rotation. The second is a discipline. They are not the same thing.
The rotational incident commander has authority, but it is the wrong shape. A senior engineer running a bridge has technical authority - the room defers to them on how to fix the thing. An engineering manager running a bridge has team authority - their reports execute and their peers cooperate to whatever degree the org chart governs. Neither of these is the authority the role actually requires, which is coordinative - the standing to direct attention across functions whose hierarchies the IC does not sit inside, the position from which to tell customer success and product and the VP that the room will hold for ninety seconds while a decision is made about comms. Coordinative authority is bounded. It is not the authority to make the technical call or the team call or the political call. It is the authority to hold the coordination of the people who do.
It is also oriented.
My wife works in surgery. The temperature in the operating theatre is set by someone who is not in the room. The surgeon can have it changed, but they cannot change it themselves; they must communicate the need to the role, who makes the adjustment. The setting is not chosen for the comfort of the people present - it is chosen for the patient. The room is held cold for the patient even though the surgeons would prefer it warmer, because the room is not for the surgeons. The role that holds the temperature is the role oriented toward the patient and away from the room.
The rotational IC has neither the bounded coordinative authority nor the patient orientation. They are of the room. They have been pulled into the role from the room, and they will return to the room when the incident is over. Their measures of success are the room’s measures. Their relationships are the room’s relationships. The patient - the customer, the user - has no advocate in the room because no role in the room is structured to be one. So the room defaults to its own metrics. The metrics record the incident as resolved. The patient leaves, some weeks or months later, for reasons the system that knew about the incident will not track.
The objection to all of this is one I have held myself, in some version, for years. The policy of not bigfooting incident bridges contained the assumption that the model in place was structurally sound. I would not have written this essay without first having to admit that it was not.
The objection runs roughly: coordination is a leadership skill that any senior person can develop. The rotational model works because the people in it are senior enough to hold coordinative authority and patient orientation alongside their other contributions. The cases where it fails are cases where the wrong person was in the rotation, or the org has not trained well enough, or the IC was having a bad day. The fix is better people in the existing model, not a new discipline.
This has truth. The rotational model works in small organisations. It works in tightly-cohered teams whose incidents are bounded and whose customers are few enough to be visible to everyone in the room. It works when the volume is low enough that the people in the rotation can hold coordinative authority and outcome orientation alongside their other work without strain. In those contexts, the senior-engineer-on-rotation is the right answer.
The contexts where it fails are the contexts where it most needs to work.
The surgeon cannot adjust the temperature themselves. To change it they have to relay the request to someone whose role is to hold the temperature, who then makes the change and confirms it back. The architecture of the operating theatre enforces what the surgical discipline already requires: that orientation toward the operation and orientation toward the temperature are different orientations, held by different people, communicated across a deliberate boundary.
Software has built no equivalent architecture. We put the manager-as-IC and the senior-engineer-as-IC inside the room and ask them to hold both orientations at once, under pressure, for several hours or several days. They are competent. They are senior. They cannot hold both. One orientation loses. The patient drifts out of frame in small increments until the room is being run for the room.
The rotational model works until the room and the patient diverge. They always diverge. By the time they have, the room has built its metrics around its own comfort.
What this costs, when gotten wrong consistently, is a thing that does not appear in any single retrospective.
A customer who churns six weeks after a bridge that the metrics said had gone fine. A customer service rep who had been trying to flag what nobody on the bridge would hear, who is reviewed at the end of the year on a metric that does not include having been right. An engineer who runs incidents the way their seniority equips them to run them, who burns out from a role nobody has named and that nobody is going to thank them for. An incident that resolves cleanly on the metrics and quietly poisons three account relationships because the room could not see what the room was being measured against.
None of this is what the postmortem says happened. The postmortem says the fix shipped, the TTR was acceptable despite the extended duration, the on-call rotation worked as designed. The room’s metrics record the room’s experience. The patient leaves quietly some weeks or months later for reasons the system that knew about the incident will not track.
What is missing has a name. We could give it one.
The role would be coordinative rather than commanding. Bounded rather than ultimate. Oriented toward the patient by the structure of the role itself, not by the goodwill of whoever is awake when the page lands. It would be hired for, trained for, and protected from being collapsed back into the rotation. It would have authority over coordination and not over the work being coordinated. It would not be the senior engineer or the engineering manager, although either of them might do it well if they were trained for it and given it as their full role.
We have built the architecture for this in other industries. We have not built it in ours.
There is a bridge open somewhere right now. It is well-run. The metrics, when they are recorded, will say it ended in success.





