Troubleshooting Guide
On-call at 3AM. Read top to bottom.
Top 10 symptoms
| # | Symptom | First check | Section |
|---|---|---|---|
| 1 | "Login is broken" / loop | /debug/ready + browser network tab |
§Login broken |
| 2 | One user can't login, others fine | Account state + MFA | §User blocked |
| 3 | All admin sessions suddenly invalid | Signing key rotation + issuer change | §Sessions |
| 4 | HUPH API returns 401 everywhere | JWKS + clock + issuer | §API 401 |
| 5 | Zitadel container restart loop | Logs: masterkey / DB conn / setup needed | §CrashLoop |
| 6 | Setup / migration failing during upgrade | Setup logs + DB version | §MigrationFail |
| 7 | 502/504 from reverse proxy | Zitadel up? upstream config? | §Proxy |
| 8 | Slow logins (>5s) | DB latency + password hash CPU | §Slow |
| 9 | "Invalid state" OIDC errors | Cookie domain + clock skew | §OIDC state |
| 10 | MFA device lost by user | Reset flow | §MFA reset |
§Login broken ("all users cannot log in")
# 1. Is Zitadel alive?
curl -sS -o /dev/null -w '%{http_code}\n' http://127.0.0.1:8080/debug/healthz # 200?
curl -sS -o /dev/null -w '%{http_code}\n' http://127.0.0.1:8080/debug/ready # 200?
# 2. From outside?
curl -sS https://auth.huph.val.id/.well-known/openid-configuration | jq .issuer
# 3. Container logs
docker compose logs --tail=300 zitadel | grep -iE 'error|fatal|panic'
# 4. DB reachable?
docker exec zitadel-postgres pg_isready -U zitadel -d zitadel
# 5. Recent change? git log /srv/zitadel
git -C /srv/zitadel log --oneline -20
Common fixes:
- /debug/ready 503 after upgrade → setup didn't finish → re-run docker compose run --rm zitadel setup.
- OIDC discovery 404 → ExternalDomain mismatch in config. Fix config, re-run setup.
- TLS cert expired → check openssl s_client -connect auth.huph.val.id:443 </dev/null | openssl x509 -noout -dates.
§User blocked
# In Zitadel console: Users → search → state.
# Common: "Inactive", "Initial" (never completed init email), "Locked" (too many bad PW).
- Locked (account lockout policy): unlock in console or
POST /management/v1/users/{id}/_unlock. - Initial: resend invite email.
- Inactive: someone deactivated — check events:
SELECT * FROM eventstore.events WHERE aggregate_id='<user_id>' ORDER BY created_date DESC LIMIT 20;
§Sessions — all admin sessions invalid
Usually one of:
- Signing key rotated. Check console → Projects → Keys. If rotation just happened, JWTs minted before are now invalid. Users must re-login. This is by design; communicate via Slack.
- Issuer URL changed. Old tokens have iss: https://auth.huph.val.id, apps expect iss: https://auth.huph.val.id/. Tailing slash matters. Check apps/api JWT verify config.
- NEXTAUTH_SECRET mismatch between admin + api. See CLAUDE.md note. Fix env and restart both.
§API 401 — token validation failing
# 1. Fetch JWKS
curl -sS https://auth.huph.val.id/oauth/v2/keys | jq
# 2. Decode a problem token (in apps/api logs)
echo '<jwt>' | cut -d. -f2 | base64 -d | jq
# 3. Check:
# - iss matches expected issuer
# - aud contains the expected client_id
# - exp > now() (clock skew? check `timedatectl`)
# - kid in header exists in JWKS
Fixes:
- JWKS 404 / wrong path → verify OIDC discovery URL path.
- Clock skew > 60s → timedatectl set-ntp true on API host.
- kid not found → JWKS cache stale in apps/api. Restart API or shorten cache TTL.
§CrashLoop
| Log line | Meaning / Fix |
|---|---|
masterkey must be 32 bytes |
Key file wrong length. wc -c /etc/zitadel/masterkey |
cannot decrypt ... cipher: message authentication failed |
WRONG master key for this DB. Stop. Restore correct key. |
database "zitadel" does not exist |
Init phase skipped. Run setup. |
migration X failed: ... |
See §MigrationFail |
bind: address already in use |
Port 8080 held by other process. |
§MigrationFail
# Capture the failed migration name from logs.
# DO NOT re-run setup blindly if it left a partial migration.
# Option A (safe): restore DB from pre-upgrade backup, pin previous image, boot.
# See 02-upgrade.md Rollback.
# Option B: if the error is transient (lock timeout), retry:
docker compose run --rm zitadel setup --masterkeyFile /run/secrets/masterkey
# Option C: open a Zitadel GitHub issue with logs before attempting schema surgery.
Never hand-edit eventstore.events.
§Proxy (nginx 502/504)
# Upstream healthy?
curl -sS http://127.0.0.1:8080/debug/healthz
# nginx config matches?
grep -RIn 'proxy_pass' /etc/nginx/sites-enabled/auth*
# grpc support enabled? Zitadel needs HTTP/2 upstream for some endpoints.
# Confirm: `proxy_http_version 1.1;` plus correct headers.
§Slow logins
# DB query latency
docker exec zitadel-postgres psql -U zitadel -d zitadel -c \
"select query, mean_exec_time from pg_stat_statements order by mean_exec_time desc limit 10;"
# CPU saturation (password hashing is bcrypt/argon2, CPU-bound)
docker stats zitadel --no-stream
Zitadel recommends 4+ CPU cores for password-hash spikes. At 50 users this only matters during enrolment storms.
§OIDC state errors
Almost always:
- Cookie domain mismatch (auth.huph.val.id vs admin.huph.val.id — needs shared parent domain or correct sameSite config).
- Clock skew > 30s between user browser and server (rare, unfixable server-side).
- Session store flushed (Valkey restart) — users re-login.
§MFA reset
User lost phone / YubiKey:
- Verify identity out of band (video call + employee ID). MFA bypass is the #1 social-engineering target.
- Zitadel console → User → Multifactor → Remove.
- User re-enrolls at next login.
- Log in
docs/ops/zitadel/mfa-reset-log.mdwith date, admin, verification method, user.