Knowledge Base — Chatbot Content
Purpose
This page explains how to manage HUPH's Knowledge Base (KB): where Aria pulls its answers from, how to add new sources (crawl or upload), how to evaluate answer quality, and how to detect content gaps. Intended for counselors, marketing staff, and content curation teams.
Prerequisites
- Logged in as counselor, marketing, or admin
- Understand the difference between FAQ (exact-match, ~300 ms) and KB (semantic search, ~6 seconds but far more flexible)
3 Entry Paths
The KB receives content from three sources:
- Web Sources — automated crawl from a website (e.g. uph.edu)
- Documents — manual upload of PDF/DOCX/TXT
- FAQ — manual curation (handled on the FAQ page, not indexed into the vector DB)
Current crawl state: ~308+ pages from uph.edu are indexed (page-sitemap, programs, about, academic, Indonesian pages).
Steps
1. Open Knowledge Base
Click Knowledge base in the sidebar. You will see 4 tabs: Documents, Web Sources, Evaluation, Gaps.
2. Add a web source (crawl)
Go to Web Sources → + Add Source. Fill in:
- URL — e.g.
https://www.uph.edu/program/medicine - Type — pick one:
- Single Page — just that URL
- Sitemap — crawl per
sitemap.xml - Full Site — recursive crawl from the start URL (be careful, may be many pages)
- Schedule — one-time or recurring (weekly)
Click Save, then Sync Now to crawl immediately. The crawler runs in the background with retry logic (3 attempts, exponential backoff, 180s timeout per page).
3. Upload a document manually
Go to Documents → + Upload. Pick a file:
- PDF — official brochures, handbooks, requirement docs
- DOCX — internal draft documents
- TXT/MD — interview transcripts, manual Q&A
Click Upload. The document is processed, split into chunks, and embedded into the vector DB in ~5–30 seconds depending on size.
4. Indexing status
Every document/page has a status:
- Pending — waiting in the crawl/embed queue
- Crawling / Indexing — being processed
- Indexed (green) — ready for the chatbot
- Failed (red) — error; hover to see the reason
If it stays Pending for more than 10 minutes, see troubleshooting.
5. Evaluation — check answer quality
The Evaluation tab has two features:
Retrieval Sandbox: type a user question → see which documents Aria would retrieve to answer. Useful for debugging when users complain about inaccurate replies.
Golden QA Dataset: 21 test questions with expected answers. Click Run Eval to execute all. The eval uses Claude as judge for faithfulness, relevancy, and context precision. Current baseline: 95.2% pass rate (20/21), avg faithfulness 0.94, avg relevancy 0.93.
6. Gap Detection
The Gaps tab shows topics that users frequently ask about but that the KB does not cover well. Gap detection uses automated clustering — conversations with low answer quality are grouped and surfaced here.
Example: the Gaps tab shows "graduate tuition" mentioned 12 times this week with avg relevancy 0.4. Action: add a document or web source covering that topic, or create a new FAQ.
Example scenarios
Add a new program source. UPH opens a new "Animation" program.
The counselor goes to Web Sources → add source
https://uph.edu/program/animation → Full Site crawl → Sync Now.
~5 minutes later, the document is indexed. Test in the Retrieval
Sandbox: "What is UPH's animation program?" → the new document
appears in the retrieval results → notify marketing "KB is updated".
Investigate a user complaint. User complaint: "Aria answered scholarship wrong, said minimum GPA 3.5 but it's 3.2". The counselor opens the Evaluation Sandbox → types "merit scholarship minimum GPA" → looks at which documents are retrieved → finds an old brochure that says 3.5. Fix: delete or update the old document, upload the latest, re-index.
Troubleshooting
Crawl stuck at Pending for more than 10 minutes. Symptom: the
document makes no progress. Cause: the crawler worker is stuck or
the embedding API is slow (OpenAI embeddings can take 5–10 seconds
per document). Fix: wait 5 more minutes; if still stuck, contact the
dev team to check docker logs huph-crawler-worker --tail 20.
Document status "Failed". Symptom: red indicator in the list. Cause: corrupt PDF, odd encoding, or file > 50 MB. Fix: hover to see the error message. If it's an encoding issue, convert to a standard PDF (Adobe) and re-upload. If oversized, split into multiple files.
Eval pass rate drops from 95%. Symptom: after a large crawl, the eval drops from 95.2% to 80%. Cause: new documents may contain info that contradicts the golden dataset. Fix: contact the eval team to investigate; don't automatically delete new docs before understanding the cause.
Gap detection not updating. Symptom: the Gaps tab is empty even though users are complaining. Cause: gap detection runs via a daily batch job — not real-time. Fix: wait until the next day, or trigger it manually via the dev team.
See also
- FAQ — when to use FAQ vs KB
- Troubleshooting — other dashboard issues