Troubleshooting Guide

On-call at 3AM. Read top to bottom.

Top 10 symptoms

#	Symptom	First check	Section
1	"Login is broken" / loop	`/debug/ready` + browser network tab	§Login broken
2	One user can't login, others fine	Account state + MFA	§User blocked
3	All admin sessions suddenly invalid	Signing key rotation + issuer change	§Sessions
4	HUPH API returns 401 everywhere	JWKS + clock + issuer	§API 401
5	Zitadel container restart loop	Logs: masterkey / DB conn / setup needed	§CrashLoop
6	Setup / migration failing during upgrade	Setup logs + DB version	§MigrationFail
7	502/504 from reverse proxy	Zitadel up? upstream config?	§Proxy
8	Slow logins (>5s)	DB latency + password hash CPU	§Slow
9	"Invalid state" OIDC errors	Cookie domain + clock skew	§OIDC state
10	MFA device lost by user	Reset flow	§MFA reset

Bash

# 1. Is Zitadel alive?
curl -sS -o /dev/null -w '%{http_code}\n' http://127.0.0.1:8080/debug/healthz   # 200?
curl -sS -o /dev/null -w '%{http_code}\n' http://127.0.0.1:8080/debug/ready     # 200?

# 2. From outside?
curl -sS https://auth.huph.val.id/.well-known/openid-configuration | jq .issuer

# 3. Container logs
docker compose logs --tail=300 zitadel | grep -iE 'error|fatal|panic'

# 4. DB reachable?
docker exec zitadel-postgres pg_isready -U zitadel -d zitadel

# 5. Recent change? git log /srv/zitadel
git -C /srv/zitadel log --oneline -20

Common fixes: - /debug/ready 503 after upgrade → setup didn't finish → re-run docker compose run --rm zitadel setup. - OIDC discovery 404 → ExternalDomain mismatch in config. Fix config, re-run setup. - TLS cert expired → check openssl s_client -connect auth.huph.val.id:443 </dev/null | openssl x509 -noout -dates.

§User blocked

Bash

# In Zitadel console: Users → search → state.
# Common: "Inactive", "Initial" (never completed init email), "Locked" (too many bad PW).

Locked (account lockout policy): unlock in console or POST /management/v1/users/{id}/_unlock.
Initial: resend invite email.
Inactive: someone deactivated — check events: SELECT * FROM eventstore.events WHERE aggregate_id='<user_id>' ORDER BY created_date DESC LIMIT 20;

§Sessions — all admin sessions invalid

Usually one of: - Signing key rotated. Check console → Projects → Keys. If rotation just happened, JWTs minted before are now invalid. Users must re-login. This is by design; communicate via Slack. - Issuer URL changed. Old tokens have iss: https://auth.huph.val.id, apps expect iss: https://auth.huph.val.id/. Tailing slash matters. Check apps/api JWT verify config. - NEXTAUTH_SECRET mismatch between admin + api. See CLAUDE.md note. Fix env and restart both.

§API 401 — token validation failing

Bash

# 1. Fetch JWKS
curl -sS https://auth.huph.val.id/oauth/v2/keys | jq

# 2. Decode a problem token (in apps/api logs)
echo '<jwt>' | cut -d. -f2 | base64 -d | jq

# 3. Check:
# - iss matches expected issuer
# - aud contains the expected client_id
# - exp > now()  (clock skew? check `timedatectl`)
# - kid in header exists in JWKS

Fixes: - JWKS 404 / wrong path → verify OIDC discovery URL path. - Clock skew > 60s → timedatectl set-ntp true on API host. - kid not found → JWKS cache stale in apps/api. Restart API or shorten cache TTL.

§CrashLoop

Bash

docker compose logs --tail=200 zitadel

Log line	Meaning / Fix
`masterkey must be 32 bytes`	Key file wrong length. `wc -c /etc/zitadel/masterkey`
`cannot decrypt ... cipher: message authentication failed`	WRONG master key for this DB. Stop. Restore correct key.
`database "zitadel" does not exist`	Init phase skipped. Run `setup`.
`migration X failed: ...`	See §MigrationFail
`bind: address already in use`	Port 8080 held by other process.

§MigrationFail

Bash

# Capture the failed migration name from logs.
# DO NOT re-run setup blindly if it left a partial migration.

# Option A (safe): restore DB from pre-upgrade backup, pin previous image, boot.
# See 02-upgrade.md Rollback.

# Option B: if the error is transient (lock timeout), retry:
docker compose run --rm zitadel setup --masterkeyFile /run/secrets/masterkey

# Option C: open a Zitadel GitHub issue with logs before attempting schema surgery.

Never hand-edit eventstore.events.

§Proxy (nginx 502/504)

Bash

# Upstream healthy?
curl -sS http://127.0.0.1:8080/debug/healthz

# nginx config matches?
grep -RIn 'proxy_pass' /etc/nginx/sites-enabled/auth*

# grpc support enabled? Zitadel needs HTTP/2 upstream for some endpoints.
# Confirm: `proxy_http_version 1.1;` plus correct headers.

§Slow logins

Bash

# DB query latency
docker exec zitadel-postgres psql -U zitadel -d zitadel -c \
    "select query, mean_exec_time from pg_stat_statements order by mean_exec_time desc limit 10;"

# CPU saturation (password hashing is bcrypt/argon2, CPU-bound)
docker stats zitadel --no-stream

Zitadel recommends 4+ CPU cores for password-hash spikes. At 50 users this only matters during enrolment storms.

§OIDC state errors

Almost always: - Cookie domain mismatch (auth.huph.val.id vs admin.huph.val.id — needs shared parent domain or correct sameSite config). - Clock skew > 30s between user browser and server (rare, unfixable server-side). - Session store flushed (Valkey restart) — users re-login.

§MFA reset

User lost phone / YubiKey:

Verify identity out of band (video call + employee ID). MFA bypass is the #1 social-engineering target.
Zitadel console → User → Multifactor → Remove.
User re-enrolls at next login.
Log in docs/ops/zitadel/mfa-reset-log.md with date, admin, verification method, user.

Troubleshooting Guide

Top 10 symptoms

§Login broken ("all users cannot log in")

§User blocked

§Sessions — all admin sessions invalid

§API 401 — token validation failing

§CrashLoop

§MigrationFail

§Proxy (nginx 502/504)

§Slow logins

§OIDC state errors

§MFA reset