Monitoring & Alerting
Zitadel exposes Prometheus metrics at /debug/metrics (OpenTelemetry format, default on). HUPH already runs Phoenix (traces) and Langfuse (LLM) — neither scrape Prometheus. Add a small Prometheus + Alertmanager alongside.
Scrape config
Add to docker-compose.monitoring.yml (new, small):
YAML
services:
prometheus:
image: prom/prometheus:v2.54.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prom-data:/prometheus
ports: ["127.0.0.1:9090:9090"]
alertmanager:
image: prom/alertmanager:v0.27.0
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
ports: ["127.0.0.1:9093:9093"]
node-exporter:
image: prom/node-exporter:v1.8.2
network_mode: host
pid: host
volumes:
- /:/host:ro,rslave
command: ["--path.rootfs=/host"]
volumes:
prom-data:
prometheus.yml:
YAML
global:
scrape_interval: 30s
evaluation_interval: 30s
rule_files: [ "/etc/prometheus/rules/zitadel.yml" ]
alerting:
alertmanagers:
- static_configs: [ { targets: ["alertmanager:9093"] } ]
scrape_configs:
- job_name: zitadel
metrics_path: /debug/metrics
static_configs:
- targets: ["host.docker.internal:8080"]
- job_name: node
static_configs:
- targets: ["host.docker.internal:9100"]
- job_name: postgres
static_configs:
- targets: ["host.docker.internal:9187"] # postgres_exporter
Alert rules (prometheus/rules/zitadel.yml)
Actionable only. No low-signal noise.
YAML
groups:
- name: zitadel-availability
rules:
- alert: ZitadelDown
expr: up{job="zitadel"} == 0
for: 2m
labels: { severity: page }
annotations:
summary: "Zitadel scrape failing for 2m"
runbook: "docs/ops/zitadel/07-troubleshooting.md#login-broken"
- alert: ZitadelReadinessFailing
expr: probe_success{job="zitadel-ready"} == 0
for: 3m
labels: { severity: page }
- alert: ZitadelHigh5xx
expr: |
sum(rate(http_requests_total{job="zitadel",code=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="zitadel"}[5m])) > 0.02
for: 5m
labels: { severity: page }
annotations:
summary: ">2% 5xx on Zitadel for 5m"
- alert: ZitadelRSSHigh
expr: process_resident_memory_bytes{job="zitadel"} > 2 * 1024 * 1024 * 1024
for: 15m
labels: { severity: ticket }
annotations:
summary: "Zitadel RSS > 2GB for 15m; suspect projection leak"
- name: zitadel-auth-signals
rules:
- alert: LoginFailureSpike
# Requires: zitadel exposes auth_request_failed_total; if not, derive from access logs.
expr: rate(zitadel_auth_request_failed_total[5m])
> 3 * rate(zitadel_auth_request_failed_total[1h] offset 1h)
for: 10m
labels: { severity: page }
annotations:
summary: "Login failures 3x above 1h-ago baseline — possible credential stuffing"
- alert: JWKSRotationStale
expr: time() - zitadel_last_signing_key_rotation_seconds > 90 * 86400
for: 1h
labels: { severity: ticket }
- name: zitadel-db
rules:
- alert: DBConnsSaturated
expr: pg_stat_activity_count{datname="zitadel"} > 0.9 * pg_settings_max_connections
for: 5m
labels: { severity: page }
- alert: DBReplicationLag
# only if replica exists
expr: pg_replication_lag_seconds > 60
for: 5m
labels: { severity: ticket }
- name: zitadel-hygiene
rules:
- alert: MasterKeyReadableByWorld
# Node exporter textfile collector writes this from a cron:
# echo masterkey_mode $(stat -c '%a' /etc/zitadel/masterkey) > /var/lib/node_exporter/textfile/zitadel.prom
expr: masterkey_mode > 400
for: 1m
labels: { severity: page }
annotations:
summary: "Master key file mode > 0400 — fix permissions immediately"
- alert: BackupStale
expr: time() - zitadel_backup_last_success_timestamp_seconds > 26 * 3600
for: 10m
labels: { severity: page }
annotations:
summary: "No successful Zitadel DB backup in 26h"
- alert: ZitadelUpgradeAvailable
# Custom exporter hits GitHub releases, emits gauge; severity: info
expr: zitadel_latest_release_age_seconds > 14 * 86400
for: 1d
labels: { severity: ticket }
annotations:
summary: "Running Zitadel is >14 days behind latest release; review advisories"
Alertmanager routes severity=page to PagerDuty/Slack #zitadel-oncall, severity=ticket to GitHub issue via webhook.
Log aggregation
- Zitadel logs to stdout →
docker compose logs→ journald via docker logging driverjournald. - Ship to central store (Loki, if running; otherwise 30-day journald retention is acceptable at 50-user scale).
- Auth events (login success/fail, token issue) are emitted as structured events in
eventstore.events— this is the canonical audit trail, not logs. Retain in DB forever (it's the source of truth). - Retain container stdout logs 90 days minimum for CVE forensics.
What NOT to alert on
- Single 401 (expected, users mistype passwords).
- Short (<2m) latency blips.
- Single failed scrape (use
for: 2m). - Memory < 2GB (Zitadel normally sits 200-800MB).