Lewati ke isi

Backup & Disaster Recovery

RPO 24h, RTO 2h. At 50 users / <1GB DB this is easy if the discipline holds.

What to back up

  1. Postgres zitadel database — all state (events, projections, config).
  2. /etc/zitadel/masterkey — without this, the DB is unreadable.
  3. /srv/zitadel/docker-compose.yml + any YAML config — for reconstruction.
  4. This ops directory — runbooks themselves (already in git).

The masterkey must travel on a separate schedule and separate destination from the DB (see 01-master-key.md §2).

Backup script (cron, hourly pg_dump, daily offsite)

/srv/zitadel/scripts/zitadel-backup.sh:

Bash
#!/usr/bin/env bash
set -euo pipefail
TS=$(date +%FT%H%M)
DST=/var/backups/zitadel
mkdir -p "$DST"

docker exec zitadel-postgres pg_dump \
    -U zitadel \
    -d zitadel \
    --format=custom \
    --no-owner \
    --no-privileges \
    --compress=9 \
  > "$DST/zitadel-$TS.dump"

# Integrity: pg_restore --list exits non-zero on a torn dump
pg_restore --list "$DST/zitadel-$TS.dump" > /dev/null

# Symlink latest
ln -sfn "zitadel-$TS.dump" "$DST/latest.dump"

# Prune: 7 daily + 4 weekly + 12 monthly
find "$DST" -name 'zitadel-*.dump' -mtime +7   -not -name '*-W*' -not -name '*-M*' -delete
# (weekly/monthly copies are tagged differently — see cron below)

Cron (/etc/cron.d/zitadel-backup):

Text Only
# Hourly local dump
15 * * * *  root  /srv/zitadel/scripts/zitadel-backup.sh >> /var/log/zitadel-backup.log 2>&1

# Nightly push to offsite (DB only — key goes separately)
30 2 * * *  root  rclone copy /var/backups/zitadel/latest.dump offsite:zitadel-db/$(date +\%F)/ --log-file=/var/log/zitadel-rclone.log

# Weekly tag (Sundays)
45 2 * * 0  root  cp /var/backups/zitadel/latest.dump /var/backups/zitadel/zitadel-W$(date +\%Y\%U).dump

# Monthly tag (1st of month)
50 2 1 * *  root  cp /var/backups/zitadel/latest.dump /var/backups/zitadel/zitadel-M$(date +\%Y\%m).dump

# Masterkey re-verification (reads offsite copy, confirms 32 bytes)
0 3 * * 1   root  /srv/zitadel/scripts/verify-masterkey-backup.sh

Retention

  • Hourly: 24 copies (24h, local only).
  • Daily: 7 (local + offsite).
  • Weekly: 4 (local + offsite).
  • Monthly: 12 (offsite only, archived quarterly).

Total offsite footprint at HUPH scale: <5GB for a full year.

Restore drill (quarterly, paired with master-key drill)

Bash
# On a scratch host / VM:
docker run -d --name pg-restore-test -e POSTGRES_PASSWORD=x postgres:16
docker exec -i pg-restore-test createdb -U postgres zitadel
cat /var/backups/zitadel/latest.dump | docker exec -i pg-restore-test \
    pg_restore -U postgres -d zitadel --no-owner

# Count rows in eventstore.events — should be close to prod
docker exec pg-restore-test psql -U postgres -d zitadel \
    -c 'select count(*) from eventstore.events;'

# Boot scratch Zitadel against it + scratch masterkey copy, hit /debug/ready.
# Log result in drill-log.md.

Disaster scenarios

Scenario Recovery RTO
Host dead (disk/VM loss) New host → docker compose → restore latest dump + mount masterkey copy → DNS swap 1-2h
DB corrupted (disk error) Stop zitadel → pg_restore from last good dump → restart → smoke 30min
Master key lost, DB intact Unrecoverable. Accept data loss. Stand up fresh instance, re-provision users/apps. 4-8h
Master key lost + DB lost Stand up fresh, hand-recreate users. See break-glass. 1 day
Ransomware on backup destination Restore from weekly/monthly on separate credential offsite. Rotate all creds. 4h

Break-glass admin

Purpose: get back in when OIDC itself is broken or no admin can log in.

  • Create a dedicated local user break-glass@huph.val.id in Zitadel with IAM_OWNER role.
  • Password: 32 random chars, printed, sealed in tamper-evident envelope, stored in physical ops safe.
  • Do NOT give this account MFA via the same Zitadel instance (circular dependency). Use a TOTP seed printed on a second paper slip in the same envelope — or accept password-only and rely on envelope seal as the second factor.
  • Rotate every 180 days or immediately after use. Log rotation in docs/ops/zitadel/break-glass-log.md.
  • Usage is a reportable event: post in #zitadel-ops within 1 hour.