The AI Runbook pattern
In my previous post on Headless Agents in CI/CD, I talked about two patterns for using AI in pipelines — a fully agentic workflow and an agent-accelerated workflow. In this post, I will explore a specific application of this paradigm to operational runbooks — what I call the AI Runbook pattern.
Background
One of my first jobs was in Ops where I was responsible for responding to alerts, triaging incidents, and keeping the system running. We had detailed runbooks — documents authored by SMEs — that outlined the steps to take when certain alerts fired. They were our go-to resource for resolving these alerts. They also had clear instructions to page an SME to take further action if the runbook steps didn’t resolve the issue. Over time, I transitioned into the SME role and it was now my turn to be on call and get paged by someone following my runbooks.
A few key things stood out to me in that experience:
- Runbooks were a mix of commands to execute and, based on the outputs, follow different paths. As an Ops Engineer, I focused on the commands and their outputs, often failing to capture the reasoning behind what I did next. As an SME, I found that the reasoning was often more important than the commands themselves. The command outputs changed depending on the time of day, the state of the system, and other factors. The runbook steps didn’t always capture that nuance.
- The runbooks were only as good as the last time they were updated. If the system changed and the runbook wasn’t updated, it could lead to confusion and delays in resolution.
- A significant set of alerts would be resolved with “run the runbook again.” In most cases, the alerts fired due to a release or events that the Ops teams weren’t in the loop for. While runbooks instructed the Ops team to review another system, the interpretation was left to the human and their experience within the company.
The AI Runbook pattern opens new possibilities. Done right, it could provide an Ops Engineer persona to an SME or an SME persona to an Ops Engineer. From the Ops side, the agent does the investigative work — reasoning over multiple systems, capturing diagnostics, surfacing a recommendation — so an engineer working multiple alerts at once can stay on top of all of them. From the SME side, they get a consistent report that captures the nuance of the situation and the reasoning behind it, clear enough to use as the basis for updating the runbook for next time.
What makes it different from standard automations?
Automation is appropriate when the process is known, programmable, and safe to apply without review. Consider a Kubernetes pod crashlooping due to an OOM condition. If your response is always to increase the memory limit and redeploy, you can automate that. The decision tree is short, the inputs are measurable, and the failure mode of getting it wrong is understood.
Now consider the same alert firing for a different reason. The new release has a misconfigured liveness probe — a new engineer copied an old config without realising the environment had different requirements. The automation runs, increases memory limits, and redeploys. Nothing improves. The alert fires again. The automation runs again. The on-call engineer wakes up to a loop, not a resolution.
This is where the AI Runbook pattern is different. The agent doesn’t execute a fixed response — it reads the deployment history, inspects the probe configuration, correlates the timing of the alert with the recent release, and surfaces the misconfiguration. The output isn’t an action. It’s a diagnosis: what happened, why it probably happened, and what a human should do next. The engineer reviews it, confirms it, and acts (after receiving a one-pager instead of a ticket with a lot of comments).
The distinction is full automation is for situations where you know the answer before the alert fires. The AI Runbook pattern is for situations where finding the answer is the work.
How I see it working in practice?
The execution flow is straightforward. An alert fires and triggers a pipeline step. The agent receives the runbook the Ops team follows alongside the alert data as context. It then gathers additional context autonomously — querying APIs, reading logs, correlating events across systems. Platforms like Octopus Deploy can go further, dropping the agent directly onto the deployment target to pull live state if needed. The agent produces a markdown report: a diagnosis, the reasoning behind it, and a recommended course of action.
sequenceDiagram
participant P as Alert / pipeline
participant A as AI agent
participant E as External systems
participant H as Human
P->>A: Alert + runbook + alert data
A->>E: Query logs, APIs, live state
E-->>A: Logs, metrics, config, state
Note over A: Correlate and reason
A->>H: Diagnosis + recommendation report
alt Human acts directly
H->>H: Takes action
else Human approves follow-up run
H-->>A: Approve follow-up execution
A->>E: Act on target system
end
A-->>P: Runbook improvement suggestions
That report is the handoff point. The human reviews it and decides what happens next — either they act on the recommendation directly, or they approve the agent to proceed with a follow-up execution. Either way, the human stays in the loop for the consequential decision.
Over time, the pattern compounds. After enough executions, another agent can review enough real incidents to start recommending improvements to the runbook itself — surfacing gaps, suggesting new branches, proposing clearer instructions based on cases the original authors never anticipated. The runbook gets better with use rather than decaying between updates.
This maps directly to the two patterns from the previous post. You start fully agentic: the agent reads broadly, surfaces what it finds, and a human decides. As confidence builds in the agent’s reasoning, you enter the agent-accelerated phase — the most reliable checks get hardcoded as structured pre-flight steps, and the agent focuses on the reasoning layer that genuinely can’t be scripted. The goal, as always, is to understand the problem well enough to know exactly where you need the agent — and where you don’t.