Skip to content

Runbook Template

Copy this when writing a new runbook. Keep every runbook <2 pages. If longer, split.

Text Only
# <Runbook title — one line, matches alert name if applicable>

**Trigger:** <exact alert name / symptom / regex>
**Severity:** page | ticket | info
**Last tested:** YYYY-MM-DD (and who)
**Owner:** <team / person>

## Symptom
One paragraph. What the on-call sees. Include exact error strings.

## Impact
Who is affected? Business impact? (e.g. "All counselors cannot log in — admin dashboard dark")

## Diagnose (copy-paste commands, in order)
```bash
# 1. First check
curl -fsS http://...

# 2. Next
docker compose logs --tail=200 zitadel | grep ERROR

Decide

  • If <condition A> → go to Remediation A
  • If <condition B> → go to Remediation B
  • Else → escalate, see §Escalation

Remediate A:

Bash
<exact commands>
Expected duration: . Expected outcome: .

Remediate B:

...

Verify

Bash
<smoke test commands>

Escalation

  • Slack: #zitadel-oncall
  • Page: via PagerDuty
  • Upstream: open issue at github.com/zitadel/zitadel with logs

Post-incident

  • Log in docs/ops/zitadel/incident-log.md
  • If hit 2+ times in a quarter: add to known-issues and fix root cause

References

  • Related runbooks: 07-troubleshooting.md#<anchor>
  • Zitadel docs:
  • Past incidents: ```

Writing rules

  • Imperative voice only. "Run X." not "You might want to run X."
  • Exact commands, no placeholders. docker compose logs zitadel, not docker logs <container>.
  • Expected output for every command. "Returns 200 ok", "Prints 1 row".
  • Time estimates on every remediation step.
  • Test the runbook by having someone who didn't write it follow it verbatim. Update where they get stuck.
  • One runbook per alert. Link from the Prometheus annotations.runbook field.

Filing conventions

  • File name: <NN>-<kebab-topic>.md, numeric prefix preserves reading order.
  • Anchor headings match alert names verbatim for quick grep.
  • Keep a running incident-log.md, drill-log.md, cve-log.md, break-glass-log.md, role-change-log.md, mfa-reset-log.md alongside — the audit trail is cheap and saves hours during incidents.