Knowledge sources (RAG)

A bot without knowledge is a generic chatbot. Mevichat turns your docs, help center, and internal content into embeddings so the bot answers from your facts — with citations.

Source types

Each bot can attach one or more sources. You can mix types on the same bot.

URL (single page) — crawl one page. Best for a pricing page, an FAQ, or a product detail.
Sitemap — point at https://example.com/sitemap.xml and we crawl every URL listed. Best for docs sites and marketing sites.
Upload — drop in PDF, DOCX, MD, CSV, or JSON. Best for policies, manuals, and exports.
Manual paste — paste raw text directly. Best for quick FAQs you want to control word-for-word.

How ingestion works

When you add a source, Mevichat fans out a Celery job that runs the same pipeline regardless of type:

Fetch — sitemap crawl or file read.
Extract — trafilatura pulls the main content from HTML and strips nav, ads, and boilerplate. Office files go through dedicated parsers.
Chunk — tiktoken (cl100k_base) splits the text into ~500-token chunks with 50-token overlap, anchored on Markdown H1/H2/H3 boundaries so sections don't bleed into each other.
Embed — each chunk is sent to text-embedding-3-small (1536 dimensions).
Index — vectors land in pgvector and are queried by nearest-neighbor search at query time.

Defaults (500-token chunks, 50-token overlap) cover the vast majority of docs. Chunk sizing isn't user-configurable today; if you're seeing retrieval issues that feel structural, get in touch and we'll take a look.

Sitemap crawler

Sitemaps are the most efficient way to onboard a public knowledge base.

https://docs.example.com/sitemap.xml

What to expect:

We fetch the sitemap index, follow nested sitemaps (up to two levels deep), and queue every <loc> URL up to your plan's URL cap.
Each URL is extracted in isolation — a broken page does not fail the whole ingest.
Duplicate URLs are deduped as they're collected.
SSRF-protected fetch: localhost, RFC1918 private ranges, link-local and metadata endpoints are refused, and HTTP redirects are not followed.

Re-ingest anytime from the source's row menu to pull the latest copy of every URL in the sitemap.

Knowledge limits per plan

Knowledge limits are enforced per bot (sources, chunks, sitemap URLs) and per file (upload size), plus a monthly message cap at the workspace level.

Plan	Per-file	Sources / bot	Chunks / bot	Sitemap URLs	Messages / mo
Free	5 MB	1	100	20	100
Pro	50 MB	10	5 000	500	5 000
Scale	200 MB	50	25 000	5 000	25 000

If you hit the per-file limit, split the file (e.g. one PDF per chapter) or upgrade. The UI surfaces remaining quota as you upload.

Embedding status

Every source has a lifecycle you can watch in the Knowledge tab:

pending — queued, waiting for a Celery worker.
processing — fetching, extracting, chunking, embedding.
ready — all chunks indexed. The bot can use them immediately.
failed — something blew up. Click the row to see the error (common causes: 403 on the URL, malformed PDF, quota exceeded).

A failed source can be retried from the same menu. We keep the original input so retry does not require re-uploading.

Re-ingest trigger

Content drifts. When your docs change:

Open the source row in the Knowledge tab.
Click Re-ingest.
We re-fetch and re-embed the source. While it's processing, the bot keeps answering from the previous chunks.

Re-ingesting a sitemap source re-fetches every URL in the sitemap — there's no lastmod-based incremental mode today.

Chunk inspection in playground

Before you go live, verify the RAG is actually retrieving what you expect.

Open any conversation in the dashboard and click Show debug on a message. The debug panel includes the list of chunk IDs that were fed to the model as context (RAG chunks), plus token counts and latency. If the top chunks are off-topic, the fix is usually on the content side — tighter headings, clearer pages, fewer SEO-stuffed paragraphs.

See Creating your first bot for the full playground workflow, or Widget theming once your retrieval looks good.

Tips for high-quality retrieval

Prefer short, focused pages over long catch-all documents.
Use real headings (h1, h2, h3) — they become chunk boundaries.
Strip template boilerplate before uploading PDFs — repeated footers dilute embeddings.
Add sources incrementally. Ingest one, test in playground, add the next.
When you delete a source, the associated chunks are removed from the index immediately.