Zitadel Ops Playbook (HUPH, 50 users)
On-call at 3AM? Jump to:
01-master-key.md— key storage, backup, recovery, rotation02-upgrade.md— pre-flight checklist, upgrade, rollback, v4→v503-cve-response.md— advisories, triage SLA, patch flow04-backup-dr.md— cron, retention, restore drill, DR scenarios05-monitoring.md— Prometheus scrape, alerts, log retention06-user-provisioning.md— add/deactivate/role-change, audit07-troubleshooting.md— top 10 symptoms + diagnosis flows08-runbook-template.md— template for new runbooks
Targets: <2 hrs/month steady-state, RPO 24h, RTO 2h.
Deployment facts (updated 2026-04-20 after Phase 0 install)
HOST : dx03 (103.30.246.131)
ZITADEL VER : v4.13.1 # update each upgrade
DB : postgres 16, container `huph-zitadel-postgres`
volume `huph_zitadel_pg_data` (isolated from huph-postgres)
COMPOSE FILE : /opt/huph/docker-compose.zitadel.yml (merged into main
docker-compose.yml via `include` directive)
MASTER KEY : /etc/huph/zitadel-masterkey.secret (root:root 0644 —
chmod 644 REQUIRED because Compose non-Swarm mode ignores
secret `mode` field; dir /etc/huph is 0750 so only root
can reach it anyway)
DB PASSWORD : /etc/huph/zitadel-db-password.secret (same ownership)
COMPOSE ENV : /etc/huph/zitadel.env (ZITADEL_MANAGEMENT_PAT, project
IDs, client credentials — chmod 600)
BACKUP DEST : /var/backups/zitadel/{daily,weekly,monthly}
+ /root/zitadel-backup-initial (first-boot copies)
offsite rclone TODO
BACKUP CRON : /etc/cron.d/zitadel-backup (03:00 WIB daily)
EXTERNAL URL : https://auth.huph.val.id (TLS via certbot, exp 2026-07-19)
NGINX VHOST : /etc/nginx/sites-enabled/auth.huph.val.id
METRICS : http://127.0.0.1:8080/debug/metrics
HEALTH : http://127.0.0.1:8080/debug/healthz
(no curl/wget inside distroless — container healthcheck
disabled; rely on Prometheus)
PROJECT : huph (id: 369367908372975628) with 6 roles:
admin, counselor:{cass,cbt,chs,cist,cne}
OIDC APP : huph-admin (client_id: 369367908775694348)
SERVICE USER : huph-mgmt (auto-bootstrapped IAM_OWNER + PAT via
FIRSTINSTANCE_PATPATH)
ACTION TARGET : huph-pre-token-claim (id: 369380257242812428)
→ http://huph-api:3101/actions/pre-token
BREAK-GLASS : org `huph-breakglass` with breakglass-1@val.id +
breakglass-2@val.id (password-only, MFA-forced).
Credentials: 1Password HUPH Ops + sealed envelope.
INITIAL ADMIN : fariz@val.id (IAM_OWNER on uph org)
ADMIN APP : admin.huph.val.id → Next.js 14 + Auth.js v5 + Zitadel
provider. AUTH_PROVIDER=zitadel in /etc/huph/admin.env.
API AUTH MODE : both (accepts Zitadel Bearer OR legacy HMAC). Plan:
flip to `zitadel` after soak.
FEATURE FLAGS : LOGINV2_REQUIRED=false (use classic Angular /ui/login;
Login UI v2 needs separate Next.js container not deployed)
Gotchas discovered during install (read before debugging 3am issues)
- DB password: Zitadel v4 doesn't honor
_FILEenv var suffix forZITADEL_DATABASE_POSTGRES_USER_PASSWORD. Pass as plain env via shell interpolation — compose${ZITADEL_DB_PASSWORD}+ shellexport ZITADEL_DB_PASSWORD=$(cat .../secret.secret). Postgres itself still usesPOSTGRES_PASSWORD_FILEnormally.
⚠ DANGER at recreate time: this is not just an install-day
concern. ANY docker compose up -d zitadel that rebuilds the
container (even for a cosmetic compose edit like a healthcheck
change) needs ZITADEL_DB_PASSWORD in the current shell, OR
Zitadel comes back up with an empty password and crash-loops on
FATAL: password authentication failed for user "zitadel". Real
incident 2026-04-20 12:15 UTC — 85s SSO outage. Safe recipe:
sudo bash -c '
export ZITADEL_DB_PASSWORD=$(cat /etc/huph/zitadel-db-password.secret)
cd /opt/huph
docker compose -p huph-zitadel-migration \
-f docker-compose.zitadel.yml up -d zitadel
'
Project name MUST be huph-zitadel-migration (not the dir-default
huph) — the volume huph_zitadel_pg_data was bound to that
project on install; wrong -p triggers a container-name conflict
and silently does nothing to the running service.
2. Docker Compose mode: 0444 on secrets ignored outside Swarm — use
chmod 644 on the host file; rely on dir perms for protection.
3. Bootstrap password required — set ZITADEL_FIRSTINSTANCE_ORG_HUMAN_PASSWORD
on first boot or you can't log in without SMTP configured.
4. Auto-PAT via ZITADEL_FIRSTINSTANCE_PATPATH avoids the chicken-and-egg
of "need PAT to create PAT". Requires tmpfs-like mount to expose to host
(our install uses /var/lib/huph-zitadel-bootstrap).
5. Login v2 disabled in this install — LOGINV2_REQUIRED=false +
blank OIDC_DEFAULTLOGINURLV2/V2 AT FIRST BOOT. Switching later needs
a fresh volume (feature projection is immutable after init).
6. Action v2 API at /v2beta/actions/* (not /resources/v3alpha/).
Execution condition uses function.name=preaccesstoken. Service-level
conditions fire too broadly. Duration fields need "5s" not "500ms"
(proto Duration parser strict).
7. Action signing key — Zitadel returns its own signingKey on target
create; use that as ZITADEL_ACTION_SECRET. Our submitted value is
discarded.
8. Script API endpoints differ from docs:
- Create org: /admin/v1/orgs/_setup (not /admin/v1/orgs); payload
uses human top-level, not user.human.
- Add member to org: /management/v1/orgs/me/members with
x-zitadel-orgid header (not /management/v1/orgs/{id}/members).
- Project roles use roleKey field (not key).
- AddHumanUserRequest.hashedPassword is top-level oneof (not nested
inside password).
Env var injection: admin.env needs ZITADEL_MANAGEMENT_PAT
The admin Next.js process (huph-admin.service) uses the Zitadel Management
API for user CRUD + MFA probe + reset-password email. This requires
ZITADEL_MANAGEMENT_PAT in the process environment.
Source of truth: /etc/huph/zitadel.env (chmod 600, root-only).
Mirror: /etc/huph/admin.env has a copy (chmod 640). Admin service
loads only admin.env — we deliberately do NOT point its
EnvironmentFile= at zitadel.env because that file also contains
ZITADEL_DB_PASSWORD and ZITADEL_CLIENT_SECRET that the admin Node
process does not need.
Rotating the PAT:
# 1. Rotate in Zitadel portal → Service Users → huph-mgmt → Personal Tokens
# 2. Update /etc/huph/zitadel.env with the new token
# 3. Mirror into /etc/huph/admin.env (same variable name)
# 4. Restart the admin service
sudo systemctl restart huph-admin
# 5. Verify: journalctl -u huph-admin --since "1 min ago" | grep 'MFA probe'
# should be empty (no 'ZITADEL_MANAGEMENT_PAT not configured' errors)
First discovered 2026-04-20 16:55 UTC — admin process had started logging
[users.GET] MFA probe failed ... ZITADEL_MANAGEMENT_PAT not configured
on every /settings/users load. Silent failure before — MFA column just
showed "—" (unknown) for all users.
Ops scripts
Read-only checks safe to run ad-hoc or on a cron (all require
/etc/huph/zitadel.env to be sourced first):
| Script | Purpose | Exit codes |
|---|---|---|
scripts/zitadel/verify-action-registration.ts |
Confirms huph-pre-token-claim target + preaccesstoken execution still point at http://huph-api:3101/actions/pre-token. Catches drift from manual UI edits or target re-creation (which rotates signingKey and silently breaks HMAC). |
0 pass, 1 drift, 2 env missing |
scripts/zitadel/reconcile-users.ts |
Nightly Zitadel → admin_users shadow reconcile. --dry-run available. |
0 pass |
scripts/zitadel/apply-login-ux.ts |
Sets login-UX policy tuned for small non-tech team: OIDC token lifetimes 12h, uph-org passwordCheckLifetime 30d, mfaInitSkipLifetime ~10y (no nag), password complexity length-only (min 10, no character-class rules — NIST 800-63B aligned). Supersedes apply-token-policy.ts. --dry-run available; idempotent. |
0 pass, 1 error, 2 env missing |
scripts/zitadel/enforce-mfa-policy.ts |
Currently NOT applied (rolled back 2026-04-20 13:36 UTC). Enables OTP + U2F on instance second-factor allowlist, Passkey on multi-factor allowlist, and sets forceMfa: true + mfaInitSkipLifetime: 0s on the uph org. Re-enable when email-OTP (SECOND_FACTOR_TYPE_OTP_EMAIL) is preferred factor + SMTP delivery verified. --dry-run available; idempotent. |
0 pass, 1 error, 2 env missing |
Run: