Apr 21, 2026
Sitemap crawling and file uploads
M3.3 — Auto-ingest
Highlights
- Sitemap crawler with trafilatura main-content extraction
- File uploads: PDF, DOCX, MD, CSV, JSON (per-plan size caps)
- Tiktoken chunker (1000 tokens, 150 overlap) with Celery fan-out embedding
- Real-customer smoke: 214 chunks indexed from a defense-tech customer site, RAG retrieval verified in playground
- Re-ingest trigger exposed in dashboard for freshened sources