Skip to content

Troubleshooting Guide

On-call at 3AM. Read top to bottom.

Top 10 symptoms

# Symptom First check Section
1 "Login is broken" / loop /debug/ready + browser network tab §Login broken
2 One user can't login, others fine Account state + MFA §User blocked
3 All admin sessions suddenly invalid Signing key rotation + issuer change §Sessions
4 HUPH API returns 401 everywhere JWKS + clock + issuer §API 401
5 Zitadel container restart loop Logs: masterkey / DB conn / setup needed §CrashLoop
6 Setup / migration failing during upgrade Setup logs + DB version §MigrationFail
7 502/504 from reverse proxy Zitadel up? upstream config? §Proxy
8 Slow logins (>5s) DB latency + password hash CPU §Slow
9 "Invalid state" OIDC errors Cookie domain + clock skew §OIDC state
10 MFA device lost by user Reset flow §MFA reset

§Login broken ("all users cannot log in")

Bash
# 1. Is Zitadel alive?
curl -sS -o /dev/null -w '%{http_code}\n' http://127.0.0.1:8080/debug/healthz   # 200?
curl -sS -o /dev/null -w '%{http_code}\n' http://127.0.0.1:8080/debug/ready     # 200?

# 2. From outside?
curl -sS https://auth.huph.val.id/.well-known/openid-configuration | jq .issuer

# 3. Container logs
docker compose logs --tail=300 zitadel | grep -iE 'error|fatal|panic'

# 4. DB reachable?
docker exec zitadel-postgres pg_isready -U zitadel -d zitadel

# 5. Recent change? git log /srv/zitadel
git -C /srv/zitadel log --oneline -20

Common fixes: - /debug/ready 503 after upgrade → setup didn't finish → re-run docker compose run --rm zitadel setup. - OIDC discovery 404 → ExternalDomain mismatch in config. Fix config, re-run setup. - TLS cert expired → check openssl s_client -connect auth.huph.val.id:443 </dev/null | openssl x509 -noout -dates.

§User blocked

Bash
# In Zitadel console: Users → search → state.
# Common: "Inactive", "Initial" (never completed init email), "Locked" (too many bad PW).
  • Locked (account lockout policy): unlock in console or POST /management/v1/users/{id}/_unlock.
  • Initial: resend invite email.
  • Inactive: someone deactivated — check events: SELECT * FROM eventstore.events WHERE aggregate_id='<user_id>' ORDER BY created_date DESC LIMIT 20;

§Sessions — all admin sessions invalid

Usually one of: - Signing key rotated. Check console → Projects → Keys. If rotation just happened, JWTs minted before are now invalid. Users must re-login. This is by design; communicate via Slack. - Issuer URL changed. Old tokens have iss: https://auth.huph.val.id, apps expect iss: https://auth.huph.val.id/. Tailing slash matters. Check apps/api JWT verify config. - NEXTAUTH_SECRET mismatch between admin + api. See CLAUDE.md note. Fix env and restart both.

§API 401 — token validation failing

Bash
# 1. Fetch JWKS
curl -sS https://auth.huph.val.id/oauth/v2/keys | jq

# 2. Decode a problem token (in apps/api logs)
echo '<jwt>' | cut -d. -f2 | base64 -d | jq

# 3. Check:
# - iss matches expected issuer
# - aud contains the expected client_id
# - exp > now()  (clock skew? check `timedatectl`)
# - kid in header exists in JWKS

Fixes: - JWKS 404 / wrong path → verify OIDC discovery URL path. - Clock skew > 60s → timedatectl set-ntp true on API host. - kid not found → JWKS cache stale in apps/api. Restart API or shorten cache TTL.

§CrashLoop

Bash
docker compose logs --tail=200 zitadel
Log line Meaning / Fix
masterkey must be 32 bytes Key file wrong length. wc -c /etc/zitadel/masterkey
cannot decrypt ... cipher: message authentication failed WRONG master key for this DB. Stop. Restore correct key.
database "zitadel" does not exist Init phase skipped. Run setup.
migration X failed: ... See §MigrationFail
bind: address already in use Port 8080 held by other process.

§MigrationFail

Bash
# Capture the failed migration name from logs.
# DO NOT re-run setup blindly if it left a partial migration.

# Option A (safe): restore DB from pre-upgrade backup, pin previous image, boot.
# See 02-upgrade.md Rollback.

# Option B: if the error is transient (lock timeout), retry:
docker compose run --rm zitadel setup --masterkeyFile /run/secrets/masterkey

# Option C: open a Zitadel GitHub issue with logs before attempting schema surgery.

Never hand-edit eventstore.events.

§Proxy (nginx 502/504)

Bash
# Upstream healthy?
curl -sS http://127.0.0.1:8080/debug/healthz

# nginx config matches?
grep -RIn 'proxy_pass' /etc/nginx/sites-enabled/auth*

# grpc support enabled? Zitadel needs HTTP/2 upstream for some endpoints.
# Confirm: `proxy_http_version 1.1;` plus correct headers.

§Slow logins

Bash
# DB query latency
docker exec zitadel-postgres psql -U zitadel -d zitadel -c \
    "select query, mean_exec_time from pg_stat_statements order by mean_exec_time desc limit 10;"

# CPU saturation (password hashing is bcrypt/argon2, CPU-bound)
docker stats zitadel --no-stream

Zitadel recommends 4+ CPU cores for password-hash spikes. At 50 users this only matters during enrolment storms.

§OIDC state errors

Almost always: - Cookie domain mismatch (auth.huph.val.id vs admin.huph.val.id — needs shared parent domain or correct sameSite config). - Clock skew > 30s between user browser and server (rare, unfixable server-side). - Session store flushed (Valkey restart) — users re-login.

§MFA reset

User lost phone / YubiKey:

  1. Verify identity out of band (video call + employee ID). MFA bypass is the #1 social-engineering target.
  2. Zitadel console → User → Multifactor → Remove.
  3. User re-enrolls at next login.
  4. Log in docs/ops/zitadel/mfa-reset-log.md with date, admin, verification method, user.