Skip to content

Debugging

Purpose

Where to look when something breaks. This page catalogs the primary logs, observability dashboards, and common recipes for the most frequent HUPH failure modes. Use in combination with operations runbooks for production incidents.

Prerequisites

  • Setup done locally OR SSH access to the production host for live debugging
  • Access to observability tools (Phoenix, Langfuse) — credentials in CREDENTIALS.md

Where to look first

API logs (Node.js)

Local dev: stdout where you ran npm run dev:api.

Production (systemd or docker):

# If running via systemd (rare)
sudo journalctl -u huph-api -f --since "5 min ago"

# If running via docker-compose (primary path)
docker-compose logs -f huph-api --tail 200

Log line format: Pino JSON with level, msg, requestId, userId, sometimes ctx. Grep for "level":"error" for errors.

Dify logs (AI pipeline)

HUPH's AI pipeline now lives in Dify, not in a local apps/rag service. When chatbot replies go wrong, start with Dify:

Production:

# Dify API container (chat-messages + KB API)
docker-compose -f docker-compose.dify.yml logs -f dify-api --tail 200

# Dify worker (embedding + indexing jobs)
docker-compose -f docker-compose.dify.yml logs -f dify-worker --tail 200

# Dify sandbox (code execution inside workflows)
docker-compose -f docker-compose.dify.yml logs -f dify-sandbox --tail 100

Dify admin UI at https://dify.huph.val.id is usually the faster debugging tool — it shows the exact prompt sent to the LLM, which KB documents were retrieved, the annotation match (if any), and the full response. See Langfuse below for LLM-level observability.

Historical note: an earlier architecture had a self-hosted apps/rag Python service on port 3102 running BGE-M3 + reranker. It was removed in April 2026. Any reference to docker-compose logs -f huph-rag or port 3102 in older docs is obsolete.

Admin logs (Next.js)

The admin runs via systemd on port 47293 in production:

sudo journalctl -u huph-admin -f --since "5 min ago"

Local dev: stdout where you ran npm run dev:admin. Next.js prints page routes as they are compiled.

Database logs (Postgres)

docker-compose logs -f huph-postgres --tail 100

Useful for trigger debugging (pg_notify events), lock waits, slow queries (if log_min_duration_statement is configured).

Observability stack

Phoenix (Arize)

  • URL: http://localhost:6006 (locally) or production URL from CREDENTIALS
  • Purpose: OpenTelemetry traces from the API — request spans, route timings, DB query traces
  • When to use: performance regressions, N+1 query hunting, seeing which code path a request actually took

Langfuse

  • URL: https://langfuse.huph.val.id (prod)
  • Purpose: LLM call observability — prompts, completions, latencies, token usage, cost
  • When to use: debugging Aria's wrong replies, prompt engineering, cost tracking per conversation

Important history: ClickHouse (Langfuse's storage backend) had a persistent OOM loop (580% → 10% CPU) that was fixed 2026-04-08 via docker/clickhouse/config.d/logs.xml — memory cap 2GB + disable noisy internal logs + 7d TTL. If Langfuse dashboard goes blank or times out, check ClickHouse first. See clickhouse-oom runbook.

Realtime health endpoint

curl -s http://localhost:3101/api/v1/health/realtime | jq

Returns pgBridge.connected, socketio.namespaces, eventCount, connected client count. If pgBridge.connected is false, the Postgres LISTEN connection is broken — realtime events won't flow.

Common debugging recipes

Chatbot reply is blank or "I don't understand"

  1. Check Dify is up: curl -sI https://dify.huph.val.id/ (should be 200 or 302)
  2. Try the Retrieval Sandbox in admin KB → Evaluation tab with the same query
  3. If sandbox retrieves correct docs but reply is blank → check Dify chat endpoint directly:
docker exec huph-api node -e "
const a=require('axios'); const KEY=process.env.DIFY_APP_API_KEY;
a.post('http://huph-dify-api:5001/v1/chat-messages', {
  query: 'halo', user: 'test', conversation_id: '',
  inputs: {user_type:'fresh_student',bot_name:'Aria',tone:'ramah',
           address_style:'Kamu',content_focus:'program',
           guidance_rules:'',emoji_usage:'moderate',answer_length:'standard'},
  response_mode: 'blocking'
}, { headers: {'Authorization':'Bearer '+KEY}, timeout:60000 })
.then(r=>console.log('OK', (r.data.answer||'').slice(0,100)))
.catch(e=>console.log('FAIL', e.message))
"
  1. Langfuse: find the specific conversation → inspect prompt assembly and LLM output

Admin dashboard shows "Offline" in realtime indicator

  1. Check NEXTAUTH_SECRET is present in the API container env:
docker exec huph-api env | grep NEXTAUTH_SECRET

Missing → Socket.io clients can't decode JWE → permanently Offline. Fix: add to docker-compose.yml + root .env, restart container.

  1. Check pgBridge is connected:
curl -s http://localhost:3101/api/v1/health/realtime | jq .pgBridge
  1. Browser DevTools → Network → WS — you should see one active connection to wss://admin.huph.val.id/socket.io/... with "101 Switching Protocols"

Webhook failing to process WhatsApp messages

  1. Check 360dialog dashboard for webhook delivery errors
  2. Check API logs for /webhook/whatsapp entries (grep the log)
  3. Test with a local curl:
curl -X POST http://localhost:3101/webhook/whatsapp \
  -H 'Content-Type: application/json' \
  -d '{"entry":[{"changes":[{"value":{"messages":[{"from":"6281234567890","text":{"body":"test"}}]}}]}]}'
  1. Check auth middleware — if API_AUTH_MODE=enforce and the request doesn't carry valid headers, you get 401. See api.en.md for auth layer details

Timestamp shows "7 hours ago" for a fresh event

This is the timezone trigger bug (WIB vs UTC). Postgres timestamp without time zone columns serialize without an offset, the frontend parses as WIB local → label is 7 hours behind. Fixed 2026-04-08 via AT TIME ZONE 'UTC' cast in all 5 trigger functions. If it returns, grep the trigger definitions for naive timestamps:

grep -n "NEW.created_at\|NEW.last_message_at" scripts/migrate-*.sql

Any occurrence without AT TIME ZONE 'UTC' is suspect.

SQL triggers not firing

# List triggers
docker exec huph-postgres psql -U huph -d huph -c \
  "SELECT tgname, tgrelid::regclass FROM pg_trigger WHERE NOT tgisinternal;"

# Test pg_notify manually
docker exec huph-postgres psql -U huph -d huph -c \
  "NOTIFY huph_events, '{\"type\":\"test\"}';"

# Watch pgBridge logs
docker logs huph-api --tail 10 | grep pgBridge

Security-first debugging rule

Before proposing any fix for a resource anomaly (high CPU, memory spike, unusual traffic), verify the system is not compromised with evidence first. Check:

  • docker stats for unexpected container activity
  • sudo last / sudo lastb for login attempts
  • sudo iptables -L -n -v for traffic anomalies
  • Recent commits for unauthorized changes

NEVER execute TRUNCATE / restart / config change on production without explicit go-ahead from the on-call. The Apr 8 ClickHouse OOM incident taught us that what looks like a simple resource issue can have a less-obvious cause. Document findings before acting.

Gotchas

  • RAG /health returns 503 for ~90 seconds after startup. Not a bug — models are loading. Wait or check docker logs for the "Uvicorn running" line.
  • docker-compose logs -f can buffer. If you see no output, force a log flush by interacting with the service (curl health, etc.).
  • Jest integration tests can hang if a httpServer.close() afterAll doesn't resolve. Known pre-existing issue in realtime.integration.test.ts. Exclude it: --testPathIgnorePatterns=realtime.integration.
  • Phoenix and Langfuse are separate stacks. They don't share state. Traces (Phoenix) and LLM calls (Langfuse) need to be correlated manually via conversation_id.

See also