Knowledge Base — Chatbot Content

Purpose

This page explains how to manage HUPH's Knowledge Base (KB): where Aria pulls its answers from, how to add new sources (crawl or upload), how to evaluate answer quality, and how to detect content gaps. Intended for counselors, marketing staff, and content curation teams.

Prerequisites

Logged in as counselor, marketing, or admin
Understand the difference between FAQ (exact-match, ~300 ms) and KB (semantic search, ~6 seconds but far more flexible)

3 Entry Paths

The KB receives content from three sources:

Web Sources — automated crawl from a website (e.g. uph.edu)
Documents — manual upload of PDF/DOCX/TXT
FAQ — manual curation (handled on the FAQ page, not indexed into the vector DB)

Current crawl state: ~308+ pages from uph.edu are indexed (page-sitemap, programs, about, academic, Indonesian pages).

Steps

1. Open Knowledge Base

Click Knowledge base in the sidebar. You will see 4 tabs: Documents, Web Sources, Evaluation, Gaps.

2. Add a web source (crawl)

Go to Web Sources → + Add Source. Fill in:

URL — e.g. https://www.uph.edu/program/medicine
Type — pick one:
Single Page — just that URL
Sitemap — crawl per sitemap.xml
Full Site — recursive crawl from the start URL (be careful, may be many pages)
Schedule — one-time or recurring (weekly)

Click Save, then Sync Now to crawl immediately. The crawler runs in the background with retry logic (3 attempts, exponential backoff, 180s timeout per page).

3. Upload a document manually

Go to Documents → + Upload. Pick a file:

PDF — official brochures, handbooks, requirement docs
DOCX — internal draft documents
TXT/MD — interview transcripts, manual Q&A

Click Upload. The document is processed, split into chunks, and embedded into the vector DB in ~5–30 seconds depending on size.

4. Indexing status

Every document/page has a status:

Pending — waiting in the crawl/embed queue
Crawling / Indexing — being processed
Indexed (green) — ready for the chatbot
Failed (red) — error; hover to see the reason

If it stays Pending for more than 10 minutes, see troubleshooting.

5. Evaluation — check answer quality

The Evaluation tab has two features:

Retrieval Sandbox: type a user question → see which documents Aria would retrieve to answer. Useful for debugging when users complain about inaccurate replies.

Golden QA Dataset: 21 test questions with expected answers. Click Run Eval to execute all. The eval uses Claude as judge for faithfulness, relevancy, and context precision. Current baseline: 95.2% pass rate (20/21), avg faithfulness 0.94, avg relevancy 0.93.

6. Gap Detection

The Gaps tab shows topics that users frequently ask about but that the KB does not cover well. Gap detection uses automated clustering — conversations with low answer quality are grouped and surfaced here.

Example: the Gaps tab shows "graduate tuition" mentioned 12 times this week with avg relevancy 0.4. Action: add a document or web source covering that topic, or create a new FAQ.

Example scenarios

Add a new program source. UPH opens a new "Animation" program. The counselor goes to Web Sources → add source https://uph.edu/program/animation → Full Site crawl → Sync Now. ~5 minutes later, the document is indexed. Test in the Retrieval Sandbox: "What is UPH's animation program?" → the new document appears in the retrieval results → notify marketing "KB is updated".

Investigate a user complaint. User complaint: "Aria answered scholarship wrong, said minimum GPA 3.5 but it's 3.2". The counselor opens the Evaluation Sandbox → types "merit scholarship minimum GPA" → looks at which documents are retrieved → finds an old brochure that says 3.5. Fix: delete or update the old document, upload the latest, re-index.

Troubleshooting

Crawl stuck at Pending for more than 10 minutes. Symptom: the document makes no progress. Cause: the crawler worker is stuck or the embedding API is slow (OpenAI embeddings can take 5–10 seconds per document). Fix: wait 5 more minutes; if still stuck, contact the dev team to check docker logs huph-crawler-worker --tail 20.

Document status "Failed". Symptom: red indicator in the list. Cause: corrupt PDF, odd encoding, or file > 50 MB. Fix: hover to see the error message. If it's an encoding issue, convert to a standard PDF (Adobe) and re-upload. If oversized, split into multiple files.

Eval pass rate drops from 95%. Symptom: after a large crawl, the eval drops from 95.2% to 80%. Cause: new documents may contain info that contradicts the golden dataset. Fix: contact the eval team to investigate; don't automatically delete new docs before understanding the cause.

Gap detection not updating. Symptom: the Gaps tab is empty even though users are complaining. Cause: gap detection runs via a daily batch job — not real-time. Fix: wait until the next day, or trigger it manually via the dev team.