Answer Evaluation — Measure Bot Response Quality

Purpose

This page explains how to use Answer Evaluation to quantitatively measure UPH chatbot answer quality. Suitable for admins validating persona/FAQ/KB changes before release, or monitoring quality on a regular cadence.

The dashboard runs a set of golden questions (exam questions with known correct answers) through the chatbot, then uses an LLM judge (Claude Haiku) to score each answer across 4 metrics.

Prerequisites

Role super_admin, system_admin, or faculty_admin
Access to https://admin.huph.val.id
Budget ~$0.05-0.10 per eval run (paid from Anthropic API budget)
Golden questions populated (ID: 62, EN: 43 as of 2026-04-23)

Metric concepts

Each answer is scored across 4 dimensions:

Metric	Meaning	Target
Faithfulness	Is the answer grounded in the knowledge base?	≥ 0.70
Answer Relevancy	Is the answer relevant to the question?	≥ 0.70
Context Precision	Are retrieved chunks relevant?	≥ 0.70
Context Recall	Does retrieval capture needed info?	≥ 0.70

Pass criterion: faithfulness ≥ 0.70 AND answer_relevancy ≥ 0.70. Pass rate = % of questions meeting this criterion.

Steps

1. Open the dashboard

Sidebar: KNOWLEDGE → Answer Eval. URL: /knowledge/eval/dashboard.

Shows: - Latest Run card: most recent eval with pass rate - Runs list: history of runs with status, duration, pass rate - Golden Questions tab: manage the question corpus

2. Understand the latest run

Example display:

Text Only

 ┌─ Latest Run: phase-a5-v2-baseline-id ──────────────┐
 │  Status: ✓ Completed  •  Duration: 8m 30s          │
 │  Questions: 62  •  Passed: 44  •  Pass rate: 70.97%│
 │                                                    │
 │  Avg Faithfulness:       0.782                     │
 │  Avg Answer Relevancy:   0.910                     │
 │  Avg Context Precision:  0.814                     │
 │  Avg Context Recall:     0.798                     │
 └────────────────────────────────────────────────────┘

3. See per-question details

Click on a run → detail page shows a table with each question's bot answer, per-metric scores, and judge reasoning.

Filters: - Status: passed / failed - Category: admission / fee / scholarship / program / general - Language: ID / EN

Use filters to identify weak categories (e.g., all "fee" questions fail → KB needs more fee-related content).

4. Manage golden questions

Golden Questions tab:

Add question: write question + expected answer + category + difficulty (easy/medium/hard)
Edit/deactivate: stale questions (e.g., fees changed) should be deactivated so they don't bias results
Import CSV: bulk-add from spreadsheet

Choose representative questions

A good golden set = mix of difficulty + mix of categories. Avoid too many variations of the same question (e.g., 10 variations of "Medicine tuition" — one is enough).

5. Run a new eval

Click Run New Eval (top right).

Config modal: - Name: label for this run (e.g., "post-cass-kb-update-2026-04-23") - Language: ID, EN, or both - Category (optional): limit to one category - Top K: retrieval chunk count (default 5)

Click Start. Runs in background (3–15 min depending on question count). You can close the page, results persist.

6. Compare two runs

GitCompare icon in runs list. Select 2 runs → dashboard shows: - Delta per metric (↑/↓/━) - Questions that flipped (pass→fail or fail→pass) - Judge reasoning for flip cases

Use this whenever you: - Update FAQs → run + compare to baseline - Change chatbot persona → check for regression - Upload new KB documents → validate coverage

7. Export CSV for reporting

On run detail page, click Download CSV. File contains: - Question, bot_answer, expected_answer - All 5 metrics - Judge reasoning - Latency per question

Good for weekly/monthly reports to UPH leadership.

Example scenarios

Pre-release validation for big FAQ batch. Marketing admin writes 20 new FAQs about scholarships. Before activating, run eval on "scholarship" category golden set. If pass rate ≥ 70%, safe to publish. If < 70%, review failing questions — the new FAQs may need rewriting.

Monthly routine audit. End of each month, super_admin runs full corpus eval (ID + EN). Export CSV, include in report to Director of Admission. Month-over-month trend becomes operational signal.

Debug quality regression. User complaints about "bot answering randomly". Admin opens eval dashboard → compare latest run to 2 weeks ago. If pass rate dropped > 5pp, find the regressed category → focus improvements there.

Troubleshooting

Eval run stuck at "running" > 30 min. Symptom: status unchanged past normal duration. Cause: Anthropic API throttling or Dify backend stall. Fix: check huph-api logs (docker logs huph-api --tail 100). If 429 errors (rate limit), wait 10 min and retry. Other errors, escalate to developer.

Sudden pass rate drop. Symptom: new run 30%+ lower than before with no FAQ changes. Cause: golden corpus changed (new hard questions added) or retrieval context changed (KB updated). Fix: compare to previous run via GitCompare, see which questions flipped — that's the root cause.

Low Faithfulness but high Relevancy. Meaning: bot answers on topic but invents facts. Fix: check KB for missing chunks, add documents/FAQs so retrieval has grounding. Don't rely on prompt alone.