Upgrade Playbook

Zitadel is stateless; all state is in Postgres eventstore.events. Upgrade = setup (runs migrations, idempotent) + start (serves traffic). Typical downtime: 30s–10min depending on projection rebuilds.

Pre-upgrade checklist (15 min, do NOT skip)

Text Only

[ ] Read release notes for EVERY version between current and target.
    https://github.com/zitadel/zitadel/releases
[ ] Read technical advisories: https://zitadel.com/docs/support/technical_advisory
[ ] Current version recorded in 00-index.md
[ ] DB backup taken within last hour  (scripts/zitadel-backup.sh)
[ ] Master key backup verified readable  (age -d < /mnt/nas/.../masterkey.age | wc -c → 32)
[ ] Disk free > 20% on DB volume (df -h /var/lib/postgresql)
[ ] Staging (if exists) upgraded 24h+ ago, still healthy
[ ] Maintenance window announced to users (Slack + status page)
[ ] On-call has this runbook open

Upgrade procedure (docker compose)

Bash

cd /srv/zitadel

# 1. Snapshot
./scripts/zitadel-backup.sh           # creates /var/backups/zitadel/pre-upgrade-$(date +%FT%H%M).sql.gz
docker compose ps > /tmp/before.txt

# 2. Pin the new version in compose, never use :latest
# image: ghcr.io/zitadel/zitadel:v4.X.Y   <-- edit to target tag
$EDITOR docker-compose.yml

# 3. Pull before stopping
docker compose pull zitadel

# 4. Stop runtime, keep DB up
docker compose stop zitadel

# 5. Run setup (migrations). Watch for errors.
docker compose run --rm zitadel setup \
    --masterkeyFile /run/secrets/masterkey \
    --tlsMode external
# Expected: "projection.xxx migrated" lines, exit 0. Duration: seconds to minutes.

# 6. Start runtime
docker compose up -d zitadel

# 7. Tail logs for 2 min
docker compose logs -f --tail=200 zitadel

Smoke tests (do ALL, in order)

Bash

# Liveness + readiness
curl -fsS http://127.0.0.1:8080/debug/healthz    # expect 200 "ok"
curl -fsS http://127.0.0.1:8080/debug/ready      # expect 200
curl -fsS https://auth.huph.val.id/.well-known/openid-configuration | jq .issuer
# expect "https://auth.huph.val.id"

# OIDC discovery has JWKS
curl -fsS https://auth.huph.val.id/oauth/v2/keys | jq '.keys | length'  # >= 1

# Admin console loads
curl -fsS -o /dev/null -w '%{http_code}\n' https://auth.huph.val.id/ui/console  # 200 or 302

# Human smoke (2 min): login as break-glass admin, open Users, open one user, logout.
# Click a "Login with Zitadel" button on admin.huph.val.id → complete flow.

# Verify HUPH API token validation still works
curl -fsS https://admin.huph.val.id/api/v1/dashboard/counselor/me -H "Cookie: $TEST_SESSION"

If any smoke test fails → rollback.

Rollback

Bash

# 1. Stop runtime
docker compose stop zitadel

# 2. Restore DB to pre-upgrade snapshot
docker exec -i zitadel-postgres psql -U postgres -c 'DROP DATABASE zitadel;'
docker exec -i zitadel-postgres psql -U postgres -c 'CREATE DATABASE zitadel OWNER zitadel;'
gunzip -c /var/backups/zitadel/pre-upgrade-*.sql.gz | \
    docker exec -i zitadel-postgres psql -U zitadel -d zitadel

# 3. Revert image tag in docker-compose.yml to previous version
$EDITOR docker-compose.yml

# 4. Boot previous version (no setup needed; DB already at old schema)
docker compose up -d zitadel

# 5. Re-run smoke tests. Open incident report.

Why restore DB, not just revert image? Setup phase migrated the event store schema. Running old binary against new schema will crash or corrupt. Always pair image-revert with DB-restore.

Major version upgrades (v4 → v5)

v5 ships Login V2 as a separate component, removes V1 Actions, and may change OIDC API shapes. Extra care:

Text Only

[ ] Staging upgrade 1-2 weeks before prod
[ ] Verify all OIDC clients (admin, counselor app) still work
[ ] If using V1 Actions: rewrite before upgrade, NOT during
[ ] Re-test token validation in apps/api (JWKS endpoint, issuer string)
[ ] Plan for Login V2 routing: extra container, extra health check
[ ] Allocate 2-hour window minimum, not 30 min
[ ] Have migration downgrade path reviewed with Zitadel community/support

Staging environment

For 50 users: yes, stand up a minimal staging. Cost: one extra docker-compose on a dev VM, separate DB, own masterkey. Amortizes the first time a setup migration fails in prod (which will happen). Run upgrades there 24-48h before prod.

Minimal compose: same file as prod with different ExternalDomain, 1GB RAM, PG in same container. Total cost ~2GB RAM on any dev box.