Lewati ke isi

Monitoring & Alerting

Zitadel exposes Prometheus metrics at /debug/metrics (OpenTelemetry format, default on). HUPH already runs Phoenix (traces) and Langfuse (LLM) — neither scrape Prometheus. Add a small Prometheus + Alertmanager alongside.

Scrape config

Add to docker-compose.monitoring.yml (new, small):

YAML
services:
  prometheus:
    image: prom/prometheus:v2.54.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prom-data:/prometheus
    ports: ["127.0.0.1:9090:9090"]

  alertmanager:
    image: prom/alertmanager:v0.27.0
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    ports: ["127.0.0.1:9093:9093"]

  node-exporter:
    image: prom/node-exporter:v1.8.2
    network_mode: host
    pid: host
    volumes:
      - /:/host:ro,rslave
    command: ["--path.rootfs=/host"]

volumes:
  prom-data:

prometheus.yml:

YAML
global:
  scrape_interval: 30s
  evaluation_interval: 30s

rule_files: [ "/etc/prometheus/rules/zitadel.yml" ]

alerting:
  alertmanagers:
    - static_configs: [ { targets: ["alertmanager:9093"] } ]

scrape_configs:
  - job_name: zitadel
    metrics_path: /debug/metrics
    static_configs:
      - targets: ["host.docker.internal:8080"]
  - job_name: node
    static_configs:
      - targets: ["host.docker.internal:9100"]
  - job_name: postgres
    static_configs:
      - targets: ["host.docker.internal:9187"]   # postgres_exporter

Alert rules (prometheus/rules/zitadel.yml)

Actionable only. No low-signal noise.

YAML
groups:
- name: zitadel-availability
  rules:
  - alert: ZitadelDown
    expr: up{job="zitadel"} == 0
    for: 2m
    labels: { severity: page }
    annotations:
      summary: "Zitadel scrape failing for 2m"
      runbook: "docs/ops/zitadel/07-troubleshooting.md#login-broken"

  - alert: ZitadelReadinessFailing
    expr: probe_success{job="zitadel-ready"} == 0
    for: 3m
    labels: { severity: page }

  - alert: ZitadelHigh5xx
    expr: |
      sum(rate(http_requests_total{job="zitadel",code=~"5.."}[5m]))
        / sum(rate(http_requests_total{job="zitadel"}[5m])) > 0.02
    for: 5m
    labels: { severity: page }
    annotations:
      summary: ">2% 5xx on Zitadel for 5m"

  - alert: ZitadelRSSHigh
    expr: process_resident_memory_bytes{job="zitadel"} > 2 * 1024 * 1024 * 1024
    for: 15m
    labels: { severity: ticket }
    annotations:
      summary: "Zitadel RSS > 2GB for 15m; suspect projection leak"

- name: zitadel-auth-signals
  rules:
  - alert: LoginFailureSpike
    # Requires: zitadel exposes auth_request_failed_total; if not, derive from access logs.
    expr: rate(zitadel_auth_request_failed_total[5m])
          > 3 * rate(zitadel_auth_request_failed_total[1h] offset 1h)
    for: 10m
    labels: { severity: page }
    annotations:
      summary: "Login failures 3x above 1h-ago baseline  possible credential stuffing"

  - alert: JWKSRotationStale
    expr: time() - zitadel_last_signing_key_rotation_seconds > 90 * 86400
    for: 1h
    labels: { severity: ticket }

- name: zitadel-db
  rules:
  - alert: DBConnsSaturated
    expr: pg_stat_activity_count{datname="zitadel"} > 0.9 * pg_settings_max_connections
    for: 5m
    labels: { severity: page }

  - alert: DBReplicationLag
    # only if replica exists
    expr: pg_replication_lag_seconds > 60
    for: 5m
    labels: { severity: ticket }

- name: zitadel-hygiene
  rules:
  - alert: MasterKeyReadableByWorld
    # Node exporter textfile collector writes this from a cron:
    #   echo masterkey_mode $(stat -c '%a' /etc/zitadel/masterkey) > /var/lib/node_exporter/textfile/zitadel.prom
    expr: masterkey_mode > 400
    for: 1m
    labels: { severity: page }
    annotations:
      summary: "Master key file mode > 0400  fix permissions immediately"

  - alert: BackupStale
    expr: time() - zitadel_backup_last_success_timestamp_seconds > 26 * 3600
    for: 10m
    labels: { severity: page }
    annotations:
      summary: "No successful Zitadel DB backup in 26h"

  - alert: ZitadelUpgradeAvailable
    # Custom exporter hits GitHub releases, emits gauge; severity: info
    expr: zitadel_latest_release_age_seconds > 14 * 86400
    for: 1d
    labels: { severity: ticket }
    annotations:
      summary: "Running Zitadel is >14 days behind latest release; review advisories"

Alertmanager routes severity=page to PagerDuty/Slack #zitadel-oncall, severity=ticket to GitHub issue via webhook.

Log aggregation

  • Zitadel logs to stdout → docker compose logs → journald via docker logging driver journald.
  • Ship to central store (Loki, if running; otherwise 30-day journald retention is acceptable at 50-user scale).
  • Auth events (login success/fail, token issue) are emitted as structured events in eventstore.eventsthis is the canonical audit trail, not logs. Retain in DB forever (it's the source of truth).
  • Retain container stdout logs 90 days minimum for CVE forensics.

What NOT to alert on

  • Single 401 (expected, users mistype passwords).
  • Short (<2m) latency blips.
  • Single failed scrape (use for: 2m).
  • Memory < 2GB (Zitadel normally sits 200-800MB).