Answer Evaluation — Measure Bot Response Quality
Purpose
This page explains how to use Answer Evaluation to quantitatively measure UPH chatbot answer quality. Suitable for admins validating persona/FAQ/KB changes before release, or monitoring quality on a regular cadence.
The dashboard runs a set of golden questions (exam questions with known correct answers) through the chatbot, then uses an LLM judge (Claude Haiku) to score each answer across 4 metrics.
Prerequisites
- Role super_admin, system_admin, or faculty_admin
- Access to
https://admin.huph.val.id - Budget ~$0.05-0.10 per eval run (paid from Anthropic API budget)
- Golden questions populated (ID: 62, EN: 43 as of 2026-04-23)
Metric concepts
Each answer is scored across 4 dimensions:
| Metric | Meaning | Target |
|---|---|---|
| Faithfulness | Is the answer grounded in the knowledge base? | ≥ 0.70 |
| Answer Relevancy | Is the answer relevant to the question? | ≥ 0.70 |
| Context Precision | Are retrieved chunks relevant? | ≥ 0.70 |
| Context Recall | Does retrieval capture needed info? | ≥ 0.70 |
Pass criterion: faithfulness ≥ 0.70 AND answer_relevancy ≥ 0.70.
Pass rate = % of questions meeting this criterion.
Steps
1. Open the dashboard
Sidebar: KNOWLEDGE → Answer Eval. URL: /knowledge/eval/dashboard.
Shows: - Latest Run card: most recent eval with pass rate - Runs list: history of runs with status, duration, pass rate - Golden Questions tab: manage the question corpus
2. Understand the latest run
Example display:
┌─ Latest Run: phase-a5-v2-baseline-id ──────────────┐
│ Status: ✓ Completed • Duration: 8m 30s │
│ Questions: 62 • Passed: 44 • Pass rate: 70.97%│
│ │
│ Avg Faithfulness: 0.782 │
│ Avg Answer Relevancy: 0.910 │
│ Avg Context Precision: 0.814 │
│ Avg Context Recall: 0.798 │
└────────────────────────────────────────────────────┘
3. See per-question details
Click on a run → detail page shows a table with each question's bot answer, per-metric scores, and judge reasoning.
Filters: - Status: passed / failed - Category: admission / fee / scholarship / program / general - Language: ID / EN
Use filters to identify weak categories (e.g., all "fee" questions fail → KB needs more fee-related content).
4. Manage golden questions
Golden Questions tab:
- Add question: write question + expected answer + category + difficulty (easy/medium/hard)
- Edit/deactivate: stale questions (e.g., fees changed) should be deactivated so they don't bias results
- Import CSV: bulk-add from spreadsheet
Choose representative questions
A good golden set = mix of difficulty + mix of categories. Avoid too many variations of the same question (e.g., 10 variations of "Medicine tuition" — one is enough).
5. Run a new eval
Click Run New Eval (top right).
Config modal: - Name: label for this run (e.g., "post-cass-kb-update-2026-04-23") - Language: ID, EN, or both - Category (optional): limit to one category - Top K: retrieval chunk count (default 5)
Click Start. Runs in background (3–15 min depending on question count). You can close the page, results persist.
6. Compare two runs
GitCompare icon in runs list. Select 2 runs → dashboard shows: - Delta per metric (↑/↓/━) - Questions that flipped (pass→fail or fail→pass) - Judge reasoning for flip cases
Use this whenever you: - Update FAQs → run + compare to baseline - Change chatbot persona → check for regression - Upload new KB documents → validate coverage
7. Export CSV for reporting
On run detail page, click Download CSV. File contains: - Question, bot_answer, expected_answer - All 5 metrics - Judge reasoning - Latency per question
Good for weekly/monthly reports to UPH leadership.
Example scenarios
Pre-release validation for big FAQ batch. Marketing admin writes 20 new FAQs about scholarships. Before activating, run eval on "scholarship" category golden set. If pass rate ≥ 70%, safe to publish. If < 70%, review failing questions — the new FAQs may need rewriting.
Monthly routine audit. End of each month, super_admin runs full corpus eval (ID + EN). Export CSV, include in report to Director of Admission. Month-over-month trend becomes operational signal.
Debug quality regression. User complaints about "bot answering randomly". Admin opens eval dashboard → compare latest run to 2 weeks ago. If pass rate dropped > 5pp, find the regressed category → focus improvements there.
Troubleshooting
Eval run stuck at "running" > 30 min. Symptom: status unchanged
past normal duration. Cause: Anthropic API throttling or Dify backend
stall. Fix: check huph-api logs
(docker logs huph-api --tail 100). If 429 errors (rate limit), wait
10 min and retry. Other errors, escalate to developer.
Sudden pass rate drop. Symptom: new run 30%+ lower than before with no FAQ changes. Cause: golden corpus changed (new hard questions added) or retrieval context changed (KB updated). Fix: compare to previous run via GitCompare, see which questions flipped — that's the root cause.
Low Faithfulness but high Relevancy. Meaning: bot answers on topic but invents facts. Fix: check KB for missing chunks, add documents/FAQs so retrieval has grounding. Don't rely on prompt alone.
See also
- Knowledge Gaps — auto-detect areas missing from the KB
- FAQ — how to add/edit FAQs that appear in golden set
- Knowledge base — manage source documents
- Bot configuration — persona and rules