Document Management
The document management system handles file uploads from any source, extracts text using the most efficient method per file type, categorizes and summarizes content using LLMs, and indexes everything into the Knowledge Base for RAG search.
Processing Pipeline
Section titled “Processing Pipeline”Every uploaded document flows through an 8-step pipeline:
Upload → Queue → Classify File Type → Extract Text → Categorize → Move → Summarize → Index| Step | Description |
|---|---|
| 1. Upload | File saved to workspace/documents/uncategorized/ with UUID filename |
| 2. Queue | Document enqueued for sequential processing |
| 3. Classify | Determine extraction strategy from MIME type and extension |
| 4. Extract text | Strategy-specific: direct read, structured parse, or OCR |
| 5. Categorize | LLM assigns a category (invoices, contracts, technical, etc.) |
| 6. Move | File moved to workspace/documents/{category}/ |
| 7. Summarize | LLM generates a concise summary |
| 8. Index | Extracted text indexed into knowledge base for RAG search |
Upload Sources
Section titled “Upload Sources”Upload via the Documents page at /documents. Supports drag-and-drop and multi-file selection.
curl -X POST http://localhost:3005/api/documents/upload \ -H "Authorization: Bearer <token>" \File attachments from Telegram, Slack, WhatsApp, and Teams are automatically downloaded and processed. The attachment handler detects processable file types and enqueues them.
Agents with the documents tool can list, view, and search uploaded documents via tool calls.
Extraction Strategies
Section titled “Extraction Strategies”Not every file needs OCR. The processor classifies each file and picks the most efficient extraction method — Office documents and text files never touch the OCR model.
Strategy Overview
Section titled “Strategy Overview”| Strategy | File Types | Method | Model Required |
|---|---|---|---|
| Text | Plain text, code, config, markup | Direct file read | No |
| Structured | Word, Excel, PowerPoint | Parse XML inside ZIP via jszip | No |
| OCR | Images, PDFs | Vision model (glm-ocr) | Yes |
Text — Direct Read
Section titled “Text — Direct Read”Files that are already human-readable are read directly with zero model calls.
Extensions: .txt, .md, .csv, .json, .xml, .yaml, .yml, .log, .ini, .conf, .toml, .env, .html, .htm, .css, .js, .ts, .py, .sh, .bash, .sql
Structured — Office Document Parsing
Section titled “Structured — Office Document Parsing”Modern Office formats (.docx, .xlsx, .pptx) are ZIP archives containing XML. The processor extracts text directly from the XML structure — fast, accurate, and requires no LLM or OCR.
| Format | Extensions | Extraction Method |
|---|---|---|
| Word | .docx | Paragraphs from word/document.xml (<w:t> elements) |
| Word (legacy) | .doc | Printable string extraction from binary OLE2 |
| Excel | .xlsx | Shared strings + cell values, tab-separated rows per sheet |
| Excel (legacy) | .xls | Printable string extraction from binary OLE2 |
| PowerPoint | .pptx | Slide text from <a:t> elements, labeled per slide |
| PowerPoint (legacy) | .ppt | Printable string extraction from binary OLE2 |
Example Excel output:
--- Sheet 1 ---Name Department SalaryAlice Engineering 95000Bob Marketing 82000OCR — Vision Model
Section titled “OCR — Vision Model”Only images and image-heavy PDFs are sent to the OCR model. PDFs first attempt direct text extraction — if the printable character ratio is high enough, OCR is skipped entirely.
| Format | Extensions | MIME Types |
|---|---|---|
| Images | .png, .jpg, .jpeg, .tiff, .bmp, .webp | image/png, image/jpeg, image/tiff, image/bmp, image/webp |
.pdf | application/pdf |
The OCR model (glm-ocr via Ollama) handles:
- Scanned documents and forms
- Photos of receipts and invoices
- Screenshots with text content
- Handwritten notes (best effort)
Categories
Section titled “Categories”Documents are categorized by LLM analysis into:
| Category | Examples |
|---|---|
invoices | Bills, payment requests |
contracts | Agreements, terms of service |
reports | Business reports, analyses |
correspondence | Emails, letters |
technical | Manuals, specifications, API docs |
receipts | Purchase confirmations |
legal | Legal filings, compliance docs |
financial | Statements, budgets |
other | Uncategorizable content |
Storage
Section titled “Storage”Documents are stored on the filesystem:
workspace/documents/├── uncategorized/ # Newly uploaded, awaiting processing├── invoices/├── contracts/├── reports/├── technical/└── ...Each file uses a UUID filename to avoid collisions. The original filename is preserved in the database.
Real-Time Processing Status
Section titled “Real-Time Processing Status”The document queue emits events for each processing stage:
| Event | Description |
|---|---|
enqueued | Document added to queue |
processing | OCR/categorization started |
completed | Pipeline finished successfully |
failed | Processing failed with error |
Events are forwarded to connected WebSocket clients (filtered by user ID), enabling real-time UI updates without polling.
API Endpoints
Section titled “API Endpoints”| Endpoint | Method | Description |
|---|---|---|
/api/documents/upload | POST | Upload files (multipart, max 50MB default) |
/api/documents | GET | List with category, status, limit filters |
/api/documents/:id | GET | Full details: OCR text, summary, metadata |
WebUI (/documents)
Section titled “WebUI (/documents)”- Upload dialog — Drag-and-drop zone with file picker, multi-file support
- Document list — Cards with category badge, status badge, size, date, summary preview
- Queue banner — Shows processing progress (“2 queued, 1 processing”)
- Detail modal — Full OCR text (scrollable), LLM summary, processing timestamps, metadata
- Filters — Search by name, filter by category pills and status dropdown
Configuration
Section titled “Configuration”| Setting | Default | Description |
|---|---|---|
workspace.documentsPath | ./workspace/documents | Base storage directory |
workspace.maxUploadSize | 52428800 (50MB) | Maximum upload size in bytes |
workspace.ocrEndpoint | http://localhost:11435 | Ollama endpoint for OCR model |
workspace.ocrModel | glm-ocr | Vision model used for image/PDF text extraction |