System Health — Monitor Operasional Platform
Purpose
Halaman ini menjelaskan cara menggunakan System Health dashboard untuk memantau kesehatan operasional platform HUPH: database, realtime socket, cron jobs, feature flags, throughput pesan, integrasi AI. Ini checkpoint pertama saat sesuatu terasa salah ("bot lambat", "inbox tidak realtime", "notifikasi tidak muncul").
Dashboard auto-refresh setiap 30 detik.
Prerequisites
- Role super_admin atau system_admin
- Akses ke
https://admin.huph.val.id/settings/system
Konsep: 7 health checks
| Check | Arti | Normal | Warning |
|---|---|---|---|
| db | Koneksi PostgreSQL | healthy, latency < 50ms | down / lag > 500ms |
| realtime | Socket.io broker (Postgres LISTEN) | healthy, channels active | disconnect atau channel down |
| triggers | Postgres triggers untuk realtime | all triggers present | missing trigger → realtime broken |
| cluster_coverage | Semua 5 cluster punya counselor | 5/5 coverage | < 5/5, ada cluster tanpa assignee |
| recent_activity | Ada aktivitas user/bot terbaru | messages dalam 1 jam | silent > 6 jam (possibly broken) |
| feature_flags | Flag config loaded | all flags readable | config file missing |
| message_throughput | Rate msgs/hour | normal range | spike/drop signifikan |
Overall status: healthy jika semua OK, degraded jika 1+ check fail.
Steps
1. Buka health dashboard
Sidebar: ADMIN → System Health (atau /settings/system).
Dashboard menampilkan cards:
┌─ System Health ───────── ● Healthy ────── ⟳ 30s ──┐
│ │
│ DB Realtime Triggers Cluster │
│ ✓ 12ms ✓ 3 chans ✓ 4/4 ✓ 5/5 │
│ │
│ Activity Flags Throughput │
│ ✓ 42 msg/h ✓ 8 loaded ✓ 3.2 msg/min │
│ │
└────────────────────────────────────────────────────┘
2. Drill down ke detail health check
Klik card untuk lihat detail: - db: connection pool size, active connections, slow query count - realtime: list channels aktif, number of subscribers - triggers: list trigger names dengan last_fired timestamp - cluster_coverage: mapping cluster → counselor count - recent_activity: timestamp last user msg, last bot msg - feature_flags: daftar flag + values - message_throughput: rate per 5 menit windows
3. Cek cron jobs
Section Cron Jobs di bawah health checks:
┌─ Cron Jobs ────────────────────────────────────────┐
│ Name Last Run Status Fails 24h│
│ follow_up 5 min ago ✓ success 0 │
│ retention_sweep 2 hours ago ✓ success 0 │
│ nightly_sync 14 hours ago ✓ success 0 │
│ ... │
└────────────────────────────────────────────────────┘
Warning: - Fails 24h > 3 → job butuh investigasi - Last Run > expected interval → scheduler stuck
4. Cek config items
Section Config menampilkan system settings yang bisa di-edit:
| Key | Value | Editable |
|---|---|---|
system.anthropic_model |
claude-haiku-4-5-20251001 | Yes |
system.anthropic_api_key |
✓ configured | Yes (masked) |
system.openai_api_key |
✓ configured | Yes (masked) |
system.intent_llm_enabled |
true | Yes |
Klik edit (pencil icon) untuk ubah. Audit log otomatis record perubahan.
5. Service health (legacy mini-cards)
Beberapa service eksternal di-check via HTTP ping: - Dify API (http://localhost:5001) - Langfuse (http://localhost:3300) - Valkey (cache) - Milvus (vector DB)
Status healthy atau unhealthy + latency.
Example scenarios
"Inbox tidak realtime" complaint. Counselor report message tidak
masuk. Admin buka System Health → cek realtime card. Jika
unhealthy, trigger Phase 1.8 fixer: run docker compose restart
huph-api dan verify socket re-attach di logs.
"Bot lambat banget". User complain response > 15 detik. Admin cek Throughput + DB latency. Jika DB slow (>500ms), cek Postgres connections. Jika throughput normal tapi bot slow, cek Dify API health (service health card).
Morning check routine. Setiap jam 08:00, super_admin buka System Health selama 30 detik. Semua green → proceed to conversations. Any red → investigate immediately before counselors masuk.
Cron failed alert. Follow_up cron gagal 5x dalam 24h. Admin
klik cron job → lihat error_message column di DB
(cron_runs table). Usually Anthropic API rate limit atau invalid
follow-up template. Fix template via
Follow-up page.
Troubleshooting
Semua cards loading forever. Gejala: auto-refresh tapi data
kosong. Penyebab: backend /api/v1/health/full tidak respond.
Perbaikan: check huph-api container running
(docker ps | grep huph-api). Kalau down, restart.
Realtime healthy tapi inbox masih tidak update. Gejala: card
green tapi UI stale. Penyebab: browser websocket disconnect.
Perbaikan: refresh halaman inbox (Ctrl+R). Kalau masih, check
NEXTAUTH_SECRET env di API — harus sama antara admin dan API
container (per CLAUDE.md gotcha).
Cluster coverage 4/5 padahal sudah assign 5 counselor. Gejala:
ada cluster yang menurut UI punya counselor tapi health check
laporan gap. Penyebab: counselor is_active = false tapi masih
assigned ke cluster. Perbaikan: buka
Users settings → reactivate atau assign
counselor lain.
Feature flag check "missing". Gejala: card menunjuk flag tidak
ter-load. Penyebab: env var belum set di container. Perbaikan:
update /opt/huph/.env (contoh: BILINGUAL_ENABLED=false) lalu
docker compose up -d huph-api.
See also
- Audit log — review aksi admin yang terkait change health
- Troubleshooting — masalah umum & solusi berbagai layer
- Getting started — roles dan akses