System Health — Monitor Platform Operations
Purpose
This page explains how to use the System Health dashboard to monitor HUPH platform operational health: database, realtime socket, cron jobs, feature flags, message throughput, AI integrations. This is the first checkpoint when something feels wrong ("bot is slow", "inbox not realtime", "notifications missing").
Dashboard auto-refreshes every 30 seconds.
Prerequisites
- Role super_admin or system_admin
- Access to
https://admin.huph.val.id/settings/system
Concept: 7 health checks
| Check | Meaning | Normal | Warning |
|---|---|---|---|
| db | PostgreSQL connection | healthy, latency < 50ms | down / lag > 500ms |
| realtime | Socket.io broker (Postgres LISTEN) | healthy, channels active | disconnect or channel down |
| triggers | Postgres triggers for realtime | all triggers present | missing trigger → realtime broken |
| cluster_coverage | All 5 clusters have counselors | 5/5 coverage | < 5/5, cluster without assignee |
| recent_activity | Recent user/bot activity | messages within 1 hour | silent > 6 hours (possibly broken) |
| feature_flags | Flag config loaded | all flags readable | config file missing |
| message_throughput | Msg/hour rate | normal range | significant spike/drop |
Overall status: healthy if all OK, degraded if 1+ check fails.
Steps
1. Open health dashboard
Sidebar: ADMIN → System Health (or /settings/system).
Dashboard shows cards:
┌─ System Health ───────── ● Healthy ────── ⟳ 30s ──┐
│ │
│ DB Realtime Triggers Cluster │
│ ✓ 12ms ✓ 3 chans ✓ 4/4 ✓ 5/5 │
│ │
│ Activity Flags Throughput │
│ ✓ 42 msg/h ✓ 8 loaded ✓ 3.2 msg/min │
│ │
└────────────────────────────────────────────────────┘
2. Drill down into check details
Click a card to see details: - db: connection pool size, active connections, slow query count - realtime: list active channels, subscriber counts - triggers: trigger names with last_fired timestamps - cluster_coverage: cluster → counselor count mapping - recent_activity: last user msg / last bot msg timestamps - feature_flags: flags + values - message_throughput: 5-min window rates
3. Check cron jobs
Cron Jobs section below health checks:
┌─ Cron Jobs ────────────────────────────────────────┐
│ Name Last Run Status Fails 24h│
│ follow_up 5 min ago ✓ success 0 │
│ retention_sweep 2 hours ago ✓ success 0 │
│ nightly_sync 14 hours ago ✓ success 0 │
│ ... │
└────────────────────────────────────────────────────┘
Warnings: - Fails 24h > 3 → job needs investigation - Last Run > expected interval → scheduler stuck
4. Check config items
Config section shows editable system settings:
| Key | Value | Editable |
|---|---|---|
system.anthropic_model |
claude-haiku-4-5-20251001 | Yes |
system.anthropic_api_key |
✓ configured | Yes (masked) |
system.openai_api_key |
✓ configured | Yes (masked) |
system.intent_llm_enabled |
true | Yes |
Click edit (pencil icon) to change. Audit log automatically records the change.
5. Service health (legacy mini-cards)
Some external services checked via HTTP ping: - Dify API (http://localhost:5001) - Langfuse (http://localhost:3300) - Valkey (cache) - Milvus (vector DB)
Status healthy or unhealthy + latency.
Example scenarios
"Inbox not realtime" complaint. Counselor reports messages
aren't arriving. Admin opens System Health → checks realtime
card. If unhealthy, trigger Phase 1.8 fixer: run docker compose
restart huph-api and verify socket re-attaches in logs.
"Bot super slow". User complains response > 15 sec. Admin checks Throughput + DB latency. If DB slow (>500ms), check Postgres connections. If throughput normal but bot slow, check Dify API health (service health card).
Morning check routine. Every day at 08:00, super_admin opens System Health for 30 seconds. All green → proceed to conversations. Any red → investigate immediately before counselors start.
Cron failed alert. Follow_up cron failed 5x in 24h. Admin
clicks cron job → checks error_message column in DB
(cron_runs table). Usually Anthropic API rate limit or invalid
follow-up template. Fix template via Follow-up
page.
Troubleshooting
All cards loading forever. Symptom: auto-refresh but data
empty. Cause: backend /api/v1/health/full isn't responding. Fix:
check huph-api container is running (docker ps | grep huph-api).
If down, restart.
Realtime healthy but inbox still not updating. Symptom: card
green but UI stale. Cause: browser websocket disconnect. Fix:
refresh inbox page (Ctrl+R). If still broken, check
NEXTAUTH_SECRET env in API — must match between admin and API
containers (per CLAUDE.md gotcha).
Cluster coverage 4/5 even though 5 counselors assigned.
Symptom: UI shows cluster has counselor but health check reports
gap. Cause: counselor is_active = false but still assigned to
cluster. Fix: open Users settings →
reactivate or reassign another counselor.
Feature flag check "missing". Symptom: card indicates flag not
loaded. Cause: env var not set in container. Fix: update
/opt/huph/.env (e.g., BILINGUAL_ENABLED=false) then
docker compose up -d huph-api.
See also
- Audit log — review admin actions related to health changes
- Troubleshooting — common problems & solutions across layers
- Getting started — roles and access