System Health — Monitor Operasional Platform

Purpose

Halaman ini menjelaskan cara menggunakan System Health dashboard untuk memantau kesehatan operasional platform HUPH: database, realtime socket, cron jobs, feature flags, throughput pesan, integrasi AI. Ini checkpoint pertama saat sesuatu terasa salah ("bot lambat", "inbox tidak realtime", "notifikasi tidak muncul").

Dashboard auto-refresh setiap 30 detik.

Prerequisites

Role super_admin atau system_admin
Akses ke https://admin.huph.val.id/settings/system

Konsep: 7 health checks

Check	Arti	Normal	Warning
db	Koneksi PostgreSQL	healthy, latency < 50ms	down / lag > 500ms
realtime	Socket.io broker (Postgres LISTEN)	healthy, channels active	disconnect atau channel down
triggers	Postgres triggers untuk realtime	all triggers present	missing trigger → realtime broken
cluster_coverage	Semua 5 cluster punya counselor	5/5 coverage	< 5/5, ada cluster tanpa assignee
recent_activity	Ada aktivitas user/bot terbaru	messages dalam 1 jam	silent > 6 jam (possibly broken)
feature_flags	Flag config loaded	all flags readable	config file missing
message_throughput	Rate msgs/hour	normal range	spike/drop signifikan

Overall status: healthy jika semua OK, degraded jika 1+ check fail.

Steps

1. Buka health dashboard

Sidebar: ADMIN → System Health (atau /settings/system).

Dashboard menampilkan cards:

Text Only

 ┌─ System Health ───────── ● Healthy ────── ⟳ 30s ──┐
 │                                                    │
 │  DB          Realtime     Triggers    Cluster     │
 │  ✓ 12ms      ✓ 3 chans    ✓ 4/4       ✓ 5/5       │
 │                                                    │
 │  Activity    Flags        Throughput               │
 │  ✓ 42 msg/h  ✓ 8 loaded   ✓ 3.2 msg/min            │
 │                                                    │
 └────────────────────────────────────────────────────┘

2. Drill down ke detail health check

Klik card untuk lihat detail: - db: connection pool size, active connections, slow query count - realtime: list channels aktif, number of subscribers - triggers: list trigger names dengan last_fired timestamp - cluster_coverage: mapping cluster → counselor count - recent_activity: timestamp last user msg, last bot msg - feature_flags: daftar flag + values - message_throughput: rate per 5 menit windows

3. Cek cron jobs

Section Cron Jobs di bawah health checks:

Text Only

 ┌─ Cron Jobs ────────────────────────────────────────┐
 │ Name              Last Run      Status    Fails 24h│
 │ follow_up         5 min ago     ✓ success   0      │
 │ retention_sweep   2 hours ago   ✓ success   0      │
 │ nightly_sync      14 hours ago  ✓ success   0      │
 │ ...                                                │
 └────────────────────────────────────────────────────┘

Warning: - Fails 24h > 3 → job butuh investigasi - Last Run > expected interval → scheduler stuck

4. Cek config items

Section Config menampilkan system settings yang bisa di-edit:

Key	Value	Editable
`system.anthropic_model`	claude-haiku-4-5-20251001	Yes
`system.anthropic_api_key`	✓ configured	Yes (masked)
`system.openai_api_key`	✓ configured	Yes (masked)
`system.intent_llm_enabled`	true	Yes

Klik edit (pencil icon) untuk ubah. Audit log otomatis record perubahan.

5. Service health (legacy mini-cards)

Beberapa service eksternal di-check via HTTP ping: - Dify API (http://localhost:5001) - Langfuse (http://localhost:3300) - Valkey (cache) - Milvus (vector DB)

Status healthy atau unhealthy + latency.

Example scenarios

"Inbox tidak realtime" complaint. Counselor report message tidak masuk. Admin buka System Health → cek realtime card. Jika unhealthy, trigger Phase 1.8 fixer: run docker compose restart huph-api dan verify socket re-attach di logs.

"Bot lambat banget". User complain response > 15 detik. Admin cek Throughput + DB latency. Jika DB slow (>500ms), cek Postgres connections. Jika throughput normal tapi bot slow, cek Dify API health (service health card).

Morning check routine. Setiap jam 08:00, super_admin buka System Health selama 30 detik. Semua green → proceed to conversations. Any red → investigate immediately before counselors masuk.

Cron failed alert. Follow_up cron gagal 5x dalam 24h. Admin klik cron job → lihat error_message column di DB (cron_runs table). Usually Anthropic API rate limit atau invalid follow-up template. Fix template via Follow-up page.

Troubleshooting

Semua cards loading forever. Gejala: auto-refresh tapi data kosong. Penyebab: backend /api/v1/health/full tidak respond. Perbaikan: check huph-api container running (docker ps | grep huph-api). Kalau down, restart.

Realtime healthy tapi inbox masih tidak update. Gejala: card green tapi UI stale. Penyebab: browser websocket disconnect. Perbaikan: refresh halaman inbox (Ctrl+R). Kalau masih, check NEXTAUTH_SECRET env di API — harus sama antara admin dan API container (per CLAUDE.md gotcha).

Cluster coverage 4/5 padahal sudah assign 5 counselor. Gejala: ada cluster yang menurut UI punya counselor tapi health check laporan gap. Penyebab: counselor is_active = false tapi masih assigned ke cluster. Perbaikan: buka Users settings → reactivate atau assign counselor lain.

Feature flag check "missing". Gejala: card menunjuk flag tidak ter-load. Penyebab: env var belum set di container. Perbaikan: update /opt/huph/.env (contoh: BILINGUAL_ENABLED=false) lalu docker compose up -d huph-api.