Skip to content

Zitadel Ops Playbook (HUPH, 50 users)

On-call at 3AM? Jump to:

  • 01-master-key.md — key storage, backup, recovery, rotation
  • 02-upgrade.md — pre-flight checklist, upgrade, rollback, v4→v5
  • 03-cve-response.md — advisories, triage SLA, patch flow
  • 04-backup-dr.md — cron, retention, restore drill, DR scenarios
  • 05-monitoring.md — Prometheus scrape, alerts, log retention
  • 06-user-provisioning.md — add/deactivate/role-change, audit
  • 07-troubleshooting.md — top 10 symptoms + diagnosis flows
  • 08-runbook-template.md — template for new runbooks

Targets: <2 hrs/month steady-state, RPO 24h, RTO 2h.

Deployment facts (updated 2026-04-20 after Phase 0 install)

Text Only
HOST          : dx03 (103.30.246.131)
ZITADEL VER   : v4.13.1                 # update each upgrade
DB            : postgres 16, container `huph-zitadel-postgres`
                volume `huph_zitadel_pg_data` (isolated from huph-postgres)
COMPOSE FILE  : /opt/huph/docker-compose.zitadel.yml (merged into main
                docker-compose.yml via `include` directive)
MASTER KEY    : /etc/huph/zitadel-masterkey.secret (root:root 0644 —
                chmod 644 REQUIRED because Compose non-Swarm mode ignores
                secret `mode` field; dir /etc/huph is 0750 so only root
                can reach it anyway)
DB PASSWORD   : /etc/huph/zitadel-db-password.secret (same ownership)
COMPOSE ENV   : /etc/huph/zitadel.env (ZITADEL_MANAGEMENT_PAT, project
                IDs, client credentials — chmod 600)
BACKUP DEST   : /var/backups/zitadel/{daily,weekly,monthly}
                + /root/zitadel-backup-initial (first-boot copies)
                offsite rclone TODO
BACKUP CRON   : /etc/cron.d/zitadel-backup (03:00 WIB daily)
EXTERNAL URL  : https://auth.huph.val.id (TLS via certbot, exp 2026-07-19)
NGINX VHOST   : /etc/nginx/sites-enabled/auth.huph.val.id
METRICS       : http://127.0.0.1:8080/debug/metrics
HEALTH        : http://127.0.0.1:8080/debug/healthz
                (no curl/wget inside distroless — container healthcheck
                disabled; rely on Prometheus)
PROJECT       : huph (id: 369367908372975628) with 6 roles:
                admin, counselor:{cass,cbt,chs,cist,cne}
OIDC APP      : huph-admin (client_id: 369367908775694348)
SERVICE USER  : huph-mgmt (auto-bootstrapped IAM_OWNER + PAT via
                FIRSTINSTANCE_PATPATH)
ACTION TARGET : huph-pre-token-claim (id: 369380257242812428)
                → http://huph-api:3101/actions/pre-token
BREAK-GLASS   : org `huph-breakglass` with breakglass-1@val.id +
                breakglass-2@val.id (password-only, MFA-forced).
                Credentials: 1Password HUPH Ops + sealed envelope.
INITIAL ADMIN : fariz@val.id (IAM_OWNER on uph org)
ADMIN APP     : admin.huph.val.id → Next.js 14 + Auth.js v5 + Zitadel
                provider. AUTH_PROVIDER=zitadel in /etc/huph/admin.env.
API AUTH MODE : both (accepts Zitadel Bearer OR legacy HMAC). Plan:
                flip to `zitadel` after soak.
FEATURE FLAGS : LOGINV2_REQUIRED=false (use classic Angular /ui/login;
                Login UI v2 needs separate Next.js container not deployed)

Gotchas discovered during install (read before debugging 3am issues)

  1. DB password: Zitadel v4 doesn't honor _FILE env var suffix for ZITADEL_DATABASE_POSTGRES_USER_PASSWORD. Pass as plain env via shell interpolation — compose ${ZITADEL_DB_PASSWORD} + shell export ZITADEL_DB_PASSWORD=$(cat .../secret.secret). Postgres itself still uses POSTGRES_PASSWORD_FILE normally.

⚠ DANGER at recreate time: this is not just an install-day concern. ANY docker compose up -d zitadel that rebuilds the container (even for a cosmetic compose edit like a healthcheck change) needs ZITADEL_DB_PASSWORD in the current shell, OR Zitadel comes back up with an empty password and crash-loops on FATAL: password authentication failed for user "zitadel". Real incident 2026-04-20 12:15 UTC — 85s SSO outage. Safe recipe:

Bash
sudo bash -c '
  export ZITADEL_DB_PASSWORD=$(cat /etc/huph/zitadel-db-password.secret)
  cd /opt/huph
  docker compose -p huph-zitadel-migration \
                 -f docker-compose.zitadel.yml up -d zitadel
'

Project name MUST be huph-zitadel-migration (not the dir-default huph) — the volume huph_zitadel_pg_data was bound to that project on install; wrong -p triggers a container-name conflict and silently does nothing to the running service. 2. Docker Compose mode: 0444 on secrets ignored outside Swarm — use chmod 644 on the host file; rely on dir perms for protection. 3. Bootstrap password required — set ZITADEL_FIRSTINSTANCE_ORG_HUMAN_PASSWORD on first boot or you can't log in without SMTP configured. 4. Auto-PAT via ZITADEL_FIRSTINSTANCE_PATPATH avoids the chicken-and-egg of "need PAT to create PAT". Requires tmpfs-like mount to expose to host (our install uses /var/lib/huph-zitadel-bootstrap). 5. Login v2 disabled in this installLOGINV2_REQUIRED=false + blank OIDC_DEFAULTLOGINURLV2/V2 AT FIRST BOOT. Switching later needs a fresh volume (feature projection is immutable after init). 6. Action v2 API at /v2beta/actions/* (not /resources/v3alpha/). Execution condition uses function.name=preaccesstoken. Service-level conditions fire too broadly. Duration fields need "5s" not "500ms" (proto Duration parser strict). 7. Action signing key — Zitadel returns its own signingKey on target create; use that as ZITADEL_ACTION_SECRET. Our submitted value is discarded. 8. Script API endpoints differ from docs: - Create org: /admin/v1/orgs/_setup (not /admin/v1/orgs); payload uses human top-level, not user.human. - Add member to org: /management/v1/orgs/me/members with x-zitadel-orgid header (not /management/v1/orgs/{id}/members). - Project roles use roleKey field (not key). - AddHumanUserRequest.hashedPassword is top-level oneof (not nested inside password).

Env var injection: admin.env needs ZITADEL_MANAGEMENT_PAT

The admin Next.js process (huph-admin.service) uses the Zitadel Management API for user CRUD + MFA probe + reset-password email. This requires ZITADEL_MANAGEMENT_PAT in the process environment.

Source of truth: /etc/huph/zitadel.env (chmod 600, root-only). Mirror: /etc/huph/admin.env has a copy (chmod 640). Admin service loads only admin.env — we deliberately do NOT point its EnvironmentFile= at zitadel.env because that file also contains ZITADEL_DB_PASSWORD and ZITADEL_CLIENT_SECRET that the admin Node process does not need.

Rotating the PAT:

Bash
# 1. Rotate in Zitadel portal → Service Users → huph-mgmt → Personal Tokens
# 2. Update /etc/huph/zitadel.env with the new token
# 3. Mirror into /etc/huph/admin.env (same variable name)
# 4. Restart the admin service
sudo systemctl restart huph-admin
# 5. Verify: journalctl -u huph-admin --since "1 min ago" | grep 'MFA probe'
#    should be empty (no 'ZITADEL_MANAGEMENT_PAT not configured' errors)

First discovered 2026-04-20 16:55 UTC — admin process had started logging [users.GET] MFA probe failed ... ZITADEL_MANAGEMENT_PAT not configured on every /settings/users load. Silent failure before — MFA column just showed "—" (unknown) for all users.

Ops scripts

Read-only checks safe to run ad-hoc or on a cron (all require /etc/huph/zitadel.env to be sourced first):

Script Purpose Exit codes
scripts/zitadel/verify-action-registration.ts Confirms huph-pre-token-claim target + preaccesstoken execution still point at http://huph-api:3101/actions/pre-token. Catches drift from manual UI edits or target re-creation (which rotates signingKey and silently breaks HMAC). 0 pass, 1 drift, 2 env missing
scripts/zitadel/reconcile-users.ts Nightly Zitadel → admin_users shadow reconcile. --dry-run available. 0 pass
scripts/zitadel/apply-login-ux.ts Sets login-UX policy tuned for small non-tech team: OIDC token lifetimes 12h, uph-org passwordCheckLifetime 30d, mfaInitSkipLifetime ~10y (no nag), password complexity length-only (min 10, no character-class rules — NIST 800-63B aligned). Supersedes apply-token-policy.ts. --dry-run available; idempotent. 0 pass, 1 error, 2 env missing
scripts/zitadel/enforce-mfa-policy.ts Currently NOT applied (rolled back 2026-04-20 13:36 UTC). Enables OTP + U2F on instance second-factor allowlist, Passkey on multi-factor allowlist, and sets forceMfa: true + mfaInitSkipLifetime: 0s on the uph org. Re-enable when email-OTP (SECOND_FACTOR_TYPE_OTP_EMAIL) is preferred factor + SMTP delivery verified. --dry-run available; idempotent. 0 pass, 1 error, 2 env missing

Run:

Bash
set -a && source /etc/huph/zitadel.env && set +a
npx tsx scripts/zitadel/verify-action-registration.ts