Document Management

The document management system handles file uploads from any source, extracts text using the most efficient method per file type, categorizes and summarizes content using LLMs, and indexes everything into the Knowledge Base for RAG search.

Processing Pipeline

Every uploaded document flows through an 8-step pipeline:

Upload → Queue → Classify File Type → Extract Text → Categorize → Move → Summarize → Index

Step	Description
1. Upload	File saved to `workspace/documents/uncategorized/` with UUID filename
2. Queue	Document enqueued for sequential processing
3. Classify	Determine extraction strategy from MIME type and extension
4. Extract text	Strategy-specific: direct read, structured parse, or OCR
5. Categorize	LLM assigns a category (invoices, contracts, technical, etc.)
6. Move	File moved to `workspace/documents/{category}/`
7. Summarize	LLM generates a concise summary
8. Index	Extracted text indexed into knowledge base for RAG search

Upload Sources

Upload via the Documents page at /documents. Supports drag-and-drop and multi-file selection.

curl -X POST http://localhost:3005/api/documents/upload \
  -H "Authorization: Bearer <token>" \
  -F "[email protected]" \
  -F "[email protected]"

Agents with the documents tool can list, view, and search uploaded documents via tool calls.

Extraction Strategies

Not every file needs OCR. The processor classifies each file and picks the most efficient extraction method — Office documents and text files never touch the OCR model.

Strategy Overview

Strategy	File Types	Method	Model Required
Text	Plain text, code, config, markup	Direct file read	No
Structured	Word, Excel, PowerPoint	Parse XML inside ZIP via `jszip`	No
OCR	Images, PDFs	Vision model (`glm-ocr`)	Yes

Text — Direct Read

Files that are already human-readable are read directly with zero model calls.

Extensions: .txt, .md, .csv, .json, .xml, .yaml, .yml, .log, .ini, .conf, .toml, .env, .html, .htm, .css, .js, .ts, .py, .sh, .bash, .sql

Structured — Office Document Parsing

Modern Office formats (.docx, .xlsx, .pptx) are ZIP archives containing XML. The processor extracts text directly from the XML structure — fast, accurate, and requires no LLM or OCR.

Format	Extensions	Extraction Method
Word	`.docx`	Paragraphs from `word/document.xml` (`<w:t>` elements)
Word (legacy)	`.doc`	Printable string extraction from binary OLE2
Excel	`.xlsx`	Shared strings + cell values, tab-separated rows per sheet
Excel (legacy)	`.xls`	Printable string extraction from binary OLE2
PowerPoint	`.pptx`	Slide text from `<a:t>` elements, labeled per slide
PowerPoint (legacy)	`.ppt`	Printable string extraction from binary OLE2

Example Excel output:

--- Sheet 1 ---
Name    Department    Salary
Alice   Engineering   95000
Bob     Marketing     82000

OCR — Vision Model

Only images and image-heavy PDFs are sent to the OCR model. PDFs first attempt direct text extraction — if the printable character ratio is high enough, OCR is skipped entirely.

Format	Extensions	MIME Types
Images	`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.webp`	`image/png`, `image/jpeg`, `image/tiff`, `image/bmp`, `image/webp`
PDF	`.pdf`	`application/pdf`

The OCR model (glm-ocr via Ollama) handles:

Scanned documents and forms
Photos of receipts and invoices
Screenshots with text content
Handwritten notes (best effort)

Category	Examples
`invoices`	Bills, payment requests
`contracts`	Agreements, terms of service
`reports`	Business reports, analyses
`correspondence`	Emails, letters
`technical`	Manuals, specifications, API docs
`receipts`	Purchase confirmations
`legal`	Legal filings, compliance docs
`financial`	Statements, budgets
`other`	Uncategorizable content

Storage

Documents are stored on the filesystem:

workspace/documents/
├── uncategorized/      # Newly uploaded, awaiting processing
├── invoices/
├── contracts/
├── reports/
├── technical/
└── ...

Each file uses a UUID filename to avoid collisions. The original filename is preserved in the database.

Real-Time Processing Status

The document queue emits events for each processing stage:

Event	Description
`enqueued`	Document added to queue
`processing`	OCR/categorization started
`completed`	Pipeline finished successfully
`failed`	Processing failed with error

Events are forwarded to connected WebSocket clients (filtered by user ID), enabling real-time UI updates without polling.

API Endpoints

Endpoint	Method	Description
`/api/documents/upload`	POST	Upload files (multipart, max 50MB default)
`/api/documents`	GET	List with `category`, `status`, `limit` filters
`/api/documents/:id`	GET	Full details: OCR text, summary, metadata

UI

WebUI (`/documents`)

Upload dialog — Drag-and-drop zone with file picker, multi-file support
Document list — Cards with category badge, status badge, size, date, summary preview
Queue banner — Shows processing progress (“2 queued, 1 processing”)
Detail modal — Full OCR text (scrollable), LLM summary, processing timestamps, metadata
Filters — Search by name, filter by category pills and status dropdown

Configuration

Setting	Default	Description
`workspace.documentsPath`	`./workspace/documents`	Base storage directory
`workspace.maxUploadSize`	`52428800` (50MB)	Maximum upload size in bytes
`workspace.ocrEndpoint`	`http://localhost:11435`	Ollama endpoint for OCR model
`workspace.ocrModel`	`glm-ocr`	Vision model used for image/PDF text extraction