Backup & Disaster Recovery
RPO 24h, RTO 2h. At 50 users / <1GB DB this is easy if the discipline holds.
What to back up
- Postgres
zitadeldatabase — all state (events, projections, config). /etc/zitadel/masterkey— without this, the DB is unreadable./srv/zitadel/docker-compose.yml+ any YAML config — for reconstruction.- This ops directory — runbooks themselves (already in git).
The masterkey must travel on a separate schedule and separate destination from the DB (see 01-master-key.md §2).
Backup script (cron, hourly pg_dump, daily offsite)
/srv/zitadel/scripts/zitadel-backup.sh:
Bash
#!/usr/bin/env bash
set -euo pipefail
TS=$(date +%FT%H%M)
DST=/var/backups/zitadel
mkdir -p "$DST"
docker exec zitadel-postgres pg_dump \
-U zitadel \
-d zitadel \
--format=custom \
--no-owner \
--no-privileges \
--compress=9 \
> "$DST/zitadel-$TS.dump"
# Integrity: pg_restore --list exits non-zero on a torn dump
pg_restore --list "$DST/zitadel-$TS.dump" > /dev/null
# Symlink latest
ln -sfn "zitadel-$TS.dump" "$DST/latest.dump"
# Prune: 7 daily + 4 weekly + 12 monthly
find "$DST" -name 'zitadel-*.dump' -mtime +7 -not -name '*-W*' -not -name '*-M*' -delete
# (weekly/monthly copies are tagged differently — see cron below)
Cron (/etc/cron.d/zitadel-backup):
Text Only
# Hourly local dump
15 * * * * root /srv/zitadel/scripts/zitadel-backup.sh >> /var/log/zitadel-backup.log 2>&1
# Nightly push to offsite (DB only — key goes separately)
30 2 * * * root rclone copy /var/backups/zitadel/latest.dump offsite:zitadel-db/$(date +\%F)/ --log-file=/var/log/zitadel-rclone.log
# Weekly tag (Sundays)
45 2 * * 0 root cp /var/backups/zitadel/latest.dump /var/backups/zitadel/zitadel-W$(date +\%Y\%U).dump
# Monthly tag (1st of month)
50 2 1 * * root cp /var/backups/zitadel/latest.dump /var/backups/zitadel/zitadel-M$(date +\%Y\%m).dump
# Masterkey re-verification (reads offsite copy, confirms 32 bytes)
0 3 * * 1 root /srv/zitadel/scripts/verify-masterkey-backup.sh
Retention
- Hourly: 24 copies (24h, local only).
- Daily: 7 (local + offsite).
- Weekly: 4 (local + offsite).
- Monthly: 12 (offsite only, archived quarterly).
Total offsite footprint at HUPH scale: <5GB for a full year.
Restore drill (quarterly, paired with master-key drill)
Bash
# On a scratch host / VM:
docker run -d --name pg-restore-test -e POSTGRES_PASSWORD=x postgres:16
docker exec -i pg-restore-test createdb -U postgres zitadel
cat /var/backups/zitadel/latest.dump | docker exec -i pg-restore-test \
pg_restore -U postgres -d zitadel --no-owner
# Count rows in eventstore.events — should be close to prod
docker exec pg-restore-test psql -U postgres -d zitadel \
-c 'select count(*) from eventstore.events;'
# Boot scratch Zitadel against it + scratch masterkey copy, hit /debug/ready.
# Log result in drill-log.md.
Disaster scenarios
| Scenario | Recovery | RTO |
|---|---|---|
| Host dead (disk/VM loss) | New host → docker compose → restore latest dump + mount masterkey copy → DNS swap | 1-2h |
| DB corrupted (disk error) | Stop zitadel → pg_restore from last good dump → restart → smoke | 30min |
| Master key lost, DB intact | Unrecoverable. Accept data loss. Stand up fresh instance, re-provision users/apps. | 4-8h |
| Master key lost + DB lost | Stand up fresh, hand-recreate users. See break-glass. | 1 day |
| Ransomware on backup destination | Restore from weekly/monthly on separate credential offsite. Rotate all creds. | 4h |
Break-glass admin
Purpose: get back in when OIDC itself is broken or no admin can log in.
- Create a dedicated local user
break-glass@huph.val.idin Zitadel with IAM_OWNER role. - Password: 32 random chars, printed, sealed in tamper-evident envelope, stored in physical ops safe.
- Do NOT give this account MFA via the same Zitadel instance (circular dependency). Use a TOTP seed printed on a second paper slip in the same envelope — or accept password-only and rely on envelope seal as the second factor.
- Rotate every 180 days or immediately after use. Log rotation in
docs/ops/zitadel/break-glass-log.md. - Usage is a reportable event: post in
#zitadel-opswithin 1 hour.