The Mediocre Live Forever
A practitioner's note on building, shipping, and leaving an AI tool that wrote post-incident reviews.
Daniel showed it to us on a Wednesday or a Thursday, the kind of day that has no character of its own, and within ten minutes I had the kind of look on my face that you generally only see when receiving an online purchase or a negroni at the end of a long day. The tool - and it was barely a tool at that point, more like a sermon delivered in the voice of a parole-board chatbot - could take a Slack channel full of incident chatter, a half-filled-out template, and a few prompt instructions, and produce something that looked, at squinting distance, like a post-incident review.
It was 2023. Large language models had been a public concern for less than a year, and inside the kind of company that ran on Jira and SOC 2 controls they were still treated as something between a curiosity and a compliance problem. Most enterprises had not yet decided whether they were tools or trinkets. The doom-or-salvation discourse existed elsewhere - on Twitter, in McKinsey reports, in the speeches Sam Altman was giving to anyone with a podium - but it had not yet reached the rooms where things actually got built. Inside our team, we had just spent another quarter watching engineers grind through the post-incident review process like someone chewing through a wall with their teeth. Daniel was an SRE. He had been quietly experimenting with what would later be called prompt engineering and was at that point still called typing stuff into ChatGPT until it stopped being stupid. He brought us a working prompt, untouched by legal, with no production access, no governance, and no expectation that it would amount to anything beyond a parlour trick.
Reader, it amounted to something.
I want to be precise about what we saw, because the optimism of that exchange is the entire reason the rest of this essay exists, and I am wary of telling you the story in a way that lets either us or the institution off the hook later. What we saw was a tool that could do, in maybe forty-five seconds, the part of a PIR that engineers found most miserable: the narrative reconstruction. The what-happened-in-order-with-timestamps-and-human-verbs. The part where you have to go back through Slack and translate “lol we restarted the pod” into at 14:07 UTC the on-call engineer initiated a pod restart in the affected service. That part. The boring part. The part nobody wanted to do.
And the demo did it. Not well, not at the level we would come to consider acceptable a year later, but well enough that you could see the shape of a future in which engineers were freed from the worst portion of an already unloved chore. Which, if you have ever been an engineer asked to write a PIR for an incident you do not really remember in the middle of a sprint you are already losing, you understand to be roughly the same scale of liberation as the invention of the dishwasher.
We took it to engineering leadership the following week. They blessed it without much theatre, which should have been a clue, and we joined the legal and privacy queue.
The legal and privacy queue is its own genre of waiting. You file the paperwork, you describe the use case, you propose a control surface, you receive twenty-three follow-up questions written by someone who has never used the product and has no plans to, and then you wait. You wait through a quarter-end. You wait through a reorg. You wait through the kind of compliance review that arrives via a ticket update from a person you have never met, to whom you are asked to explain, in non-technical terms, what a token is.
We were lucky in that our use case was relatively clean. We were not feeding the model customer data; we were feeding it internal incident chatter, which existed in a different and slightly more forgiving compliance category. We were not asking the model to make decisions; we were asking it to re-narrate, which is a thing computers had been doing in some form for decades, under names that were less marketable. So the queue was not as long as it could have been. It was, however, long enough.
So we waited. And while we waited, we built.
The prompt grew. What started as Daniel’s half-page proof-of-concept had turned into a multi-section instrument with explicit guidance for each part of the PIR template - summary, timeline, impact, root cause, contributing factors, action items. We learned the way you learn everything in prompt engineering: by watching the model fail and adjusting. The model wanted to proclaim customer names; we told it not to proclaim customer names. The model wanted to attribute blame; we told it not to attribute blame. The model wanted to use the word streamline in every other sentence; we told it the word streamline was now retired.
The thing we got right - the thing I want to underline now, before I tell you what we got wrong - was that we never asked the model to know anything. We asked it to re-arrange. The Slack logs went in. The template structure went in. The model’s job was to take a pile of inputs and produce something shaped like a PIR. Hallucination is a function of how much the model has to invent; we left it almost nothing to invent. The principle of this exercise, in case you are building one of these and would like to skip a class of failure: the model’s job is to synthesise the material you give it, not to source the material itself. If you find yourself asking the model to know things, you are asking it to lie eventually, and it will oblige.
Legal blessed it. We rolled it out. The first PIRs landed in review. The reviewers found them surprising - not because they were good, exactly, but because they existed at all, in roughly the right shape, having taken the human author perhaps twenty per cent of the time the human author would normally have spent. We had successfully automated forty-five minutes out of an unloved chore. The dishwasher metaphor from the demo, it turned out, had been roughly correct.
For a few months we felt like we had built something. I will not pretend the feeling lasted.
The first thing we noticed was the waffle. The prompt asked for concise output. The prompt asked for concise output repeatedly, in increasingly direct language, with examples of what concise looked like. The model agreed. The model agreed enthusiastically. The model then wrote summaries that took five sentences to do the work of one, in a register that one of our engineers eventually described as verbal diarrhea. It talked around itself. It explained things by length rather than brevity. The hallucination problem had been solved. The talking-too-much problem had been not just unsolved but, in some quiet way none of us yet understood, actively introduced by our own instructions.
We made a list of fixes. We did not, immediately, get to make them.
Immediately turned out to be a longer word than we’d planned.
The list of fixes went into a Jira ticket. The Jira ticket went into a backlog. The backlog went into a quarterly planning session, where it competed against twenty other things, most of which involved heads of engineering going ‘hmm’ in a concerned manner, and lost. The quarterly planning session was followed by another quarterly planning session, in which the backlog was reviewed again, and the list was, by general consensus and without any specific person being responsible, deprioritised. Nobody was opposed to fixing it. Nobody was opposed to almost anything. There was simply more shouting elsewhere.
This is a thing that happens to tools that kind of work. It is, in fact, the most reliable thing that happens to tools that kind of work. The fully-broken get fixed because the breakage is intolerable. The fully-functional get celebrated because the functioning is visible. The mediocre live forever, because the cost of revisiting them is greater than the cost of putting up with them, and because nobody who could authorise the revisit ever has to read the output themselves.
We were not idle in those months. There was always another incident, always another process to revise. We launched other things. We retired others. The team grew, shrunk, grew again. We upgraded the underlying model when GPT-4o came out, because that was the kind of small win you could ship in an afternoon; we did not revisit the prompt, because that would have required time we did not have. The PIR machine kept running quietly in the background, getting copy-pasted into reviews, getting approved, getting closed. We told ourselves we would get back to it. We did not.
Here is a principle I wish I had understood eighteen months earlier than I did: prompt engineering is craft, and craft requires maintenance windows scheduled into the calendar by people who are willing to defend them. You can ship a prompt. You can be proud of the prompt. The prompt will get worse without your noticing, not because the model is changing - though the model is also changing - but because the world the prompt is describing is changing, and the prompt is not. A prompt is a snapshot of your understanding at the moment you wrote it. Without scheduled revision, it ages the way photographs age: slowly, then all at once.
Eighteen months is a long time in software. Eighteen months is also, it turns out, exactly the length of time required for a tool that kind of works to become a tool that people are quietly furious about, without anybody being quite ready to say so. The fury was there. The fury was building. It was just waiting for someone to write it down.
The person who finally wrote it down was an engineer. I will not name them. They were having a bad PIR.
Specifically, they were having a bad PIR inside a piece of internal tooling we had built on top of Jira DC, which had been cobbled together in the way that internal tooling generally is at companies large enough to have opinions about internal tooling and small enough to defer the building of it until later. The tool worked. The tool worked the way most internal tooling works, which is to say: it worked if you held it correctly. If you did not hold it correctly - if, for instance, you attempted to save an incomplete set of information and the tool decided something was missing - the tool would, in a moment of administrative malice that nobody had specifically designed but nobody had specifically prevented, wipe everything you had entered. Not save it incorrectly. Not warn you. Wipe it.
This engineer had had this happen to them. More than once, I believe. They had then gone to the AI-generated draft to complete that, and the AI-generated draft had given them the kind of meandering, schoolboy-essay output we had been ignoring for eighteen months because, as previously established, nobody who could authorise the revisit ever had to read the output themselves.
The engineer took to Confluence.
The post compared our team - meaning specifically the team I was on, the team responsible for the PIR machine and the tooling around it - to Satan and to Hitler. The post was specific. The post named the data-wipe, the AI waffle, the time it had cost. The post did not propose solutions. The post was furious in the way that internal Confluence posts are furious, which is to say: with no structural restraint and the potential charge of physical assault on a keyboard.
It was, in many ways, the best thing that happened to the project.
Within a week, we had the authorisation we had been asking for in measured tones for eighteen months. The reason was simple, and it is the principle this section is here to land: organisations fund repair, not maintenance. Maintenance is invisible until it stops, at which point it is not maintenance, it is repair. The fury was the budget. The blog was the business case. The engineer who wrote it had, without intending to, performed the single most useful act of project sponsorship the project had ever received.
We did not thank them. I sometimes wonder if we should have.
The list of fixes came out of the drawer. Some of the fixes had aged better than others - the world had moved on, the models had moved on, the assumptions had moved on. Some of them needed to be thrown out entirely. We were not going to be patching v1. We were going to be building v2. Better. Cleaner. As it should have been, quarters ago.
The person who wanted us to use a different model worked in the AI team, which is a thing companies of a certain size have. I do not begrudge the AI team. Their job is to be the responsible adult in the room while a hundred product teams attempt to do something irreversible with someone else’s compute budget. I have been the responsible adult in rooms before, and I know what it does to a person. The cheaper model on offer was a flash model, and the fans of the flash model in this case had a spreadsheet, and on the spreadsheet there were two columns, and one of the columns was cheaper than the other. The fans of the flash model in this case had not read the prompt. The fans of the flash model in this case had not read any output the prompt had produced.
So we made the case. We made it politely, in writing, with examples; and then we made it again, less politely, in person, with more examples; and then we made it a third time, in language flat enough to survive translation through three layers of management. We compared outputs. We compared the quality of the comparisons, which is a meta-step you should always include, because the cheap-model fans will otherwise quietly judge the comparison on cost rather than on whether the comparison was fair. This took weeks. It felt like longer.
The principle that buys you back the weeks, if you have to make this case yourself: the right model for the job is not always the cheapest, and the people gatekeeping cost are not always the people who will read the output. Cost gating is a perfectly reasonable function of a perfectly reasonable team that, in the absence of skin in the game, will optimise for the metric they actually have. Your job, if you are the team who has to live with the output, is to make sure your metric is visible too.
We got Claude. The model was, on every measure we cared about, better. Claude wrote sentences a human would have written. Claude knew when to stop a sentence. Claude understood, in a way that earlier models had not quite understood, that the absence of a word was sometimes the right call. The waffle disappeared.
Around the new model, we rebuilt the prompt. The monolith became a series of sections, one per PIR component, each with its own tone guidance and its own source-grounding rules. Each section returned its output with citations to the Slack timestamps that had produced it, so the author could check the work in seconds rather than minutes. We were aware, painfully so, that the timelines for some incidents - the incidents where the team had been on a Zoom call rather than typing - were thin. Things had happened. Things had been resolved. The middle was missing.
So we added Zoom transcription via Loom. This is the part of the story where I have to tell you about one of our developers, who I will call Wilson, because we need a name and Wilson is not theirs. Wilson, in a moment of operational efficiency that we should have anticipated and did not, decided that the simplest way to ensure every incident Zoom call had a transcript was to put their own Zoom account into every incident bridge. Wilson’s account would join the call silently, record it, transcribe it, and deliver the transcript to the prompt. Wilson’s account joined a number of calls before anybody noticed.
When people noticed, the response was not measured. The response was: who is recording us, why, and what will be done with the recordings. The response was immediate, unanimous, and slightly biblical. We had built, without intending to, a small in-house panopticon, and the workforce was responding the way workforces respond when they find one in their meetings. We did the work of explaining, in clinical detail, what was recorded and what was not, where the transcripts lived and who could read them, how long they were retained and how they were destroyed. We did this because we had to.
The last change we made, and the change that mattered most, was the framing. The v1 prompt had produced output that read like a PIR. The v2 prompt produced output that read like a draft of a PIR. The difference is not subtle. The v1 output had headers and bullets and a structure that invited the reviewer to read it as finished. The v2 output had headers and bullets and a banner at the top, in language we had spent the better part of a fortnight getting right, that explained: this is a guide for the author to work from. It is not the PIR. The PIR is the document the author is responsible for writing, with this as a starting point. We had not changed what the model produced. We had changed what we called it.
The reframing helped. It did not help as much as we had hoped. There is a thing humans do when given a document that is almost what they need, which I have been trying to find a polite name for.
The polite name I have been trying to find, and the one I will settle for in the absence of better, is convenience. Humans, when given a document that is almost what they need, will treat it as if it were exactly what they need. Humans will do this even when the document is labelled in friendly capital letters DRAFT NOT FINAL. Humans will do this even when the document is preceded by a banner that the team writing the document spent two weeks getting right. Humans, in the end, do not read banners. They read the document.
The links had been thick and fast in my DMs long before v2 emerged - long before we had a name for the dynamic they evidenced, long before we had built anything to address it. Engineers I had not spoken to in months would forward me a link with the kind of brief, exhausted message that engineers send when they have run out of patience but have not yet run out of decorum: thought you’d want to see this. The thing I would ‘want’ to see, in every case, was an approved PIR that had been written by the model and submitted by a human who had not, by any reasonable interpretation of the word, written it. The PIRs were good enough to pass review. They were not good enough to be useful. They had been used anyway.
The reviewers, meanwhile, had developed a parallel adaptation. The reviewer’s job is to read the submitted PIR and decide whether it is adequate. The reviewer’s job, in practice, is to clear PIRs from the queue before the SLO expires, because the queue grows faster than the reviewer’s available reading time. The reviewer who is behind on the queue will scan, not read. PIRs written by the model scan beautifully - they were designed to. More often, and worse, they will not even scan and simply tick.
The author and the reviewer had, between them, found a workflow in which neither party was doing the human work the PIR was supposed to capture.
We built two things, and we built them quickly, because the longer this dynamic ran the more institutional memory we were quietly losing about every incident the system had processed.
The first was a copy-check. The submission flow ran a text similarity comparison between the AI-generated draft and the submitted PIR. If the similarity was too high - if the human had, in essence, copied the model’s output and called it their work - the submission was flagged, and a polite note explained that the PIR appeared to have been submitted without modification, and would the author like to take another pass. The note was polite. The note was also firm. The note had the effect of making the next twenty minutes of the author’s day moderately worse than the previous twenty had been, which was the point.
The second was a scoring system. We took every PIR from the previous two years that had been written for a sev one incident - the most-scrutinised, most-rewritten, most-stakeholder-edited PIRs we had - and we distilled them into a rubric. What did the strong ones do that the weak ones did not? What sections were always present, always specific, always actionable? What kinds of sentences did the gold-standard PIRs avoid? The rubric became an automated audit. Every submitted PIR would be scored against it. The score was visible to the author. The score was visible to the reviewer. The score did not block submission; it simply existed, on the page, in numbers, in places where numbers had not been before.
The rubric did two things. For the author, it provided immediate feedback on where the PIR was thin - your timeline lacks specificity, your contributing factors are not distinguished from your root cause, your action items have no owners - at the moment they could still fix it. For the reviewer, it provided immediate triage on where to focus their attention - this PIR scores well on timeline, poorly on action items, you can probably skip ahead and concentrate on the second half. The first audience used the rubric to write better. The second audience used the rubric to read more efficiently. Both, crucially, were doing more of the human work and less of the rubber-stamping.
Here is the principle, and I want to be careful with it because it is the one that mattered most and the one I most wish I had understood at the start: the predictable failure mode of automation is that humans will use the automation to skip the work the automation is supposed to assist with, not replace. If your tool makes a job faster, your tool will be used to skip the job. If your tool makes a job easier, the job will be skipped. If your tool produces an output that is almost the deliverable, the output will become the deliverable. This is not a failure of the humans. This is not a failure of the tool. This is the predictable interaction between a labour-saving device and a labouring human, and you need to design against it from day one. Not on day three hundred and sixty, after the fury has built and the engineer has taken to Confluence.
The copy-check has hopefully shipped, as has the rubric. The PIR quality, by every measure we had, will begin to recover. We had built, finally, a tool that knew how to coexist with the humans using it. We had built it after eighteen months of drift, one furious Confluence post, a flash-model fight, an accidental panopticon, and a framing rewrite that had nearly worked.
Reader, I'll never know for certain. I was made redundant three weeks later.







> The mediocre live forever, because the cost of revisiting them is greater than the cost of putting up with them.
So much this.
> The reframing helped. It did not help as much as we had hoped.
Agreed. Draft _was_ final in many cases.
> If your tool makes a job faster, your tool will be used to skip the job. If your tool makes a job easier, the job will be skipped. If your tool produces an output that is almost the deliverable, the output will become the deliverable.
The tool made the PIR document faster to build. To create the document, the person had to digest the problem. But that was half the battle. The second and most important half was to understand what happened, identify missing metrics or alerts, identify weaknesses in the system, and develop actions to improve both the affected area and the broader case. Because the digestion was outsourced to a tool, the second half became much harder to succeed in.