Runbook Template
Copy this when writing a new runbook. Keep every runbook <2 pages. If longer, split.
Text Only
# <Runbook title — one line, matches alert name if applicable>
**Trigger:** <exact alert name / symptom / regex>
**Severity:** page | ticket | info
**Last tested:** YYYY-MM-DD (and who)
**Owner:** <team / person>
## Symptom
One paragraph. What the on-call sees. Include exact error strings.
## Impact
Who is affected? Business impact? (e.g. "All counselors cannot log in — admin dashboard dark")
## Diagnose (copy-paste commands, in order)
```bash
# 1. First check
curl -fsS http://...
# 2. Next
docker compose logs --tail=200 zitadel | grep ERROR
Decide
- If
<condition A>→ go to Remediation A - If
<condition B>→ go to Remediation B - Else → escalate, see §Escalation
Remediate A:
Expected duration: Remediate B:
...
Verify
Escalation
- Slack: #zitadel-oncall
- Page:
via PagerDuty - Upstream: open issue at github.com/zitadel/zitadel with logs
Post-incident
- Log in
docs/ops/zitadel/incident-log.md - If hit 2+ times in a quarter: add to known-issues and fix root cause
References
- Related runbooks:
07-troubleshooting.md#<anchor> - Zitadel docs:
- Past incidents:
```
Writing rules
- Imperative voice only. "Run X." not "You might want to run X."
- Exact commands, no placeholders.
docker compose logs zitadel, notdocker logs <container>. - Expected output for every command. "Returns
200 ok", "Prints 1 row". - Time estimates on every remediation step.
- Test the runbook by having someone who didn't write it follow it verbatim. Update where they get stuck.
- One runbook per alert. Link from the Prometheus
annotations.runbookfield.
Filing conventions
- File name:
<NN>-<kebab-topic>.md, numeric prefix preserves reading order. - Anchor headings match alert names verbatim for quick grep.
- Keep a running
incident-log.md,drill-log.md,cve-log.md,break-glass-log.md,role-change-log.md,mfa-reset-log.mdalongside — the audit trail is cheap and saves hours during incidents.