System Health — Monitor Platform Operations

Purpose

This page explains how to use the System Health dashboard to monitor HUPH platform operational health: database, realtime socket, cron jobs, feature flags, message throughput, AI integrations. This is the first checkpoint when something feels wrong ("bot is slow", "inbox not realtime", "notifications missing").

Dashboard auto-refreshes every 30 seconds.

Prerequisites

Role super_admin or system_admin
Access to https://admin.huph.val.id/settings/system

Concept: 7 health checks

Check	Meaning	Normal	Warning
db	PostgreSQL connection	healthy, latency < 50ms	down / lag > 500ms
realtime	Socket.io broker (Postgres LISTEN)	healthy, channels active	disconnect or channel down
triggers	Postgres triggers for realtime	all triggers present	missing trigger → realtime broken
cluster_coverage	All 5 clusters have counselors	5/5 coverage	< 5/5, cluster without assignee
recent_activity	Recent user/bot activity	messages within 1 hour	silent > 6 hours (possibly broken)
feature_flags	Flag config loaded	all flags readable	config file missing
message_throughput	Msg/hour rate	normal range	significant spike/drop

Overall status: healthy if all OK, degraded if 1+ check fails.

Steps

1. Open health dashboard

Sidebar: ADMIN → System Health (or /settings/system).

Dashboard shows cards:

Text Only

 ┌─ System Health ───────── ● Healthy ────── ⟳ 30s ──┐
 │                                                    │
 │  DB          Realtime     Triggers    Cluster     │
 │  ✓ 12ms      ✓ 3 chans    ✓ 4/4       ✓ 5/5       │
 │                                                    │
 │  Activity    Flags        Throughput               │
 │  ✓ 42 msg/h  ✓ 8 loaded   ✓ 3.2 msg/min            │
 │                                                    │
 └────────────────────────────────────────────────────┘

2. Drill down into check details

Click a card to see details: - db: connection pool size, active connections, slow query count - realtime: list active channels, subscriber counts - triggers: trigger names with last_fired timestamps - cluster_coverage: cluster → counselor count mapping - recent_activity: last user msg / last bot msg timestamps - feature_flags: flags + values - message_throughput: 5-min window rates

3. Check cron jobs

Cron Jobs section below health checks:

Text Only

 ┌─ Cron Jobs ────────────────────────────────────────┐
 │ Name              Last Run      Status    Fails 24h│
 │ follow_up         5 min ago     ✓ success   0      │
 │ retention_sweep   2 hours ago   ✓ success   0      │
 │ nightly_sync      14 hours ago  ✓ success   0      │
 │ ...                                                │
 └────────────────────────────────────────────────────┘

Warnings: - Fails 24h > 3 → job needs investigation - Last Run > expected interval → scheduler stuck

4. Check config items

Config section shows editable system settings:

Key	Value	Editable
`system.anthropic_model`	claude-haiku-4-5-20251001	Yes
`system.anthropic_api_key`	✓ configured	Yes (masked)
`system.openai_api_key`	✓ configured	Yes (masked)
`system.intent_llm_enabled`	true	Yes

Click edit (pencil icon) to change. Audit log automatically records the change.

5. Service health (legacy mini-cards)

Some external services checked via HTTP ping: - Dify API (http://localhost:5001) - Langfuse (http://localhost:3300) - Valkey (cache) - Milvus (vector DB)

Status healthy or unhealthy + latency.

Example scenarios

"Inbox not realtime" complaint. Counselor reports messages aren't arriving. Admin opens System Health → checks realtime card. If unhealthy, trigger Phase 1.8 fixer: run docker compose restart huph-api and verify socket re-attaches in logs.

"Bot super slow". User complains response > 15 sec. Admin checks Throughput + DB latency. If DB slow (>500ms), check Postgres connections. If throughput normal but bot slow, check Dify API health (service health card).

Morning check routine. Every day at 08:00, super_admin opens System Health for 30 seconds. All green → proceed to conversations. Any red → investigate immediately before counselors start.

Cron failed alert. Follow_up cron failed 5x in 24h. Admin clicks cron job → checks error_message column in DB (cron_runs table). Usually Anthropic API rate limit or invalid follow-up template. Fix template via Follow-up page.

Troubleshooting

All cards loading forever. Symptom: auto-refresh but data empty. Cause: backend /api/v1/health/full isn't responding. Fix: check huph-api container is running (docker ps | grep huph-api). If down, restart.

Realtime healthy but inbox still not updating. Symptom: card green but UI stale. Cause: browser websocket disconnect. Fix: refresh inbox page (Ctrl+R). If still broken, check NEXTAUTH_SECRET env in API — must match between admin and API containers (per CLAUDE.md gotcha).

Cluster coverage 4/5 even though 5 counselors assigned. Symptom: UI shows cluster has counselor but health check reports gap. Cause: counselor is_active = false but still assigned to cluster. Fix: open Users settings → reactivate or reassign another counselor.

Feature flag check "missing". Symptom: card indicates flag not loaded. Cause: env var not set in container. Fix: update /opt/huph/.env (e.g., BILINGUAL_ENABLED=false) then docker compose up -d huph-api.