Skip to content

Document Management

The document management system handles file uploads from any source, extracts text using the most efficient method per file type, categorizes and summarizes content using LLMs, and indexes everything into the Knowledge Base for RAG search.

Every uploaded document flows through an 8-step pipeline:

Upload → Queue → Classify File Type → Extract Text → Categorize → Move → Summarize → Index
StepDescription
1. UploadFile saved to workspace/documents/uncategorized/ with UUID filename
2. QueueDocument enqueued for sequential processing
3. ClassifyDetermine extraction strategy from MIME type and extension
4. Extract textStrategy-specific: direct read, structured parse, or OCR
5. CategorizeLLM assigns a category (invoices, contracts, technical, etc.)
6. MoveFile moved to workspace/documents/{category}/
7. SummarizeLLM generates a concise summary
8. IndexExtracted text indexed into knowledge base for RAG search

Upload via the Documents page at /documents. Supports drag-and-drop and multi-file selection.

Not every file needs OCR. The processor classifies each file and picks the most efficient extraction method — Office documents and text files never touch the OCR model.

StrategyFile TypesMethodModel Required
TextPlain text, code, config, markupDirect file readNo
StructuredWord, Excel, PowerPointParse XML inside ZIP via jszipNo
OCRImages, PDFsVision model (glm-ocr)Yes

Files that are already human-readable are read directly with zero model calls.

Extensions: .txt, .md, .csv, .json, .xml, .yaml, .yml, .log, .ini, .conf, .toml, .env, .html, .htm, .css, .js, .ts, .py, .sh, .bash, .sql

Modern Office formats (.docx, .xlsx, .pptx) are ZIP archives containing XML. The processor extracts text directly from the XML structure — fast, accurate, and requires no LLM or OCR.

FormatExtensionsExtraction Method
Word.docxParagraphs from word/document.xml (<w:t> elements)
Word (legacy).docPrintable string extraction from binary OLE2
Excel.xlsxShared strings + cell values, tab-separated rows per sheet
Excel (legacy).xlsPrintable string extraction from binary OLE2
PowerPoint.pptxSlide text from <a:t> elements, labeled per slide
PowerPoint (legacy).pptPrintable string extraction from binary OLE2

Example Excel output:

--- Sheet 1 ---
Name Department Salary
Alice Engineering 95000
Bob Marketing 82000

Only images and image-heavy PDFs are sent to the OCR model. PDFs first attempt direct text extraction — if the printable character ratio is high enough, OCR is skipped entirely.

FormatExtensionsMIME Types
Images.png, .jpg, .jpeg, .tiff, .bmp, .webpimage/png, image/jpeg, image/tiff, image/bmp, image/webp
PDF.pdfapplication/pdf

The OCR model (glm-ocr via Ollama) handles:

  • Scanned documents and forms
  • Photos of receipts and invoices
  • Screenshots with text content
  • Handwritten notes (best effort)

Documents are categorized by LLM analysis into:

CategoryExamples
invoicesBills, payment requests
contractsAgreements, terms of service
reportsBusiness reports, analyses
correspondenceEmails, letters
technicalManuals, specifications, API docs
receiptsPurchase confirmations
legalLegal filings, compliance docs
financialStatements, budgets
otherUncategorizable content

Documents are stored on the filesystem:

workspace/documents/
├── uncategorized/ # Newly uploaded, awaiting processing
├── invoices/
├── contracts/
├── reports/
├── technical/
└── ...

Each file uses a UUID filename to avoid collisions. The original filename is preserved in the database.

The document queue emits events for each processing stage:

EventDescription
enqueuedDocument added to queue
processingOCR/categorization started
completedPipeline finished successfully
failedProcessing failed with error

Events are forwarded to connected WebSocket clients (filtered by user ID), enabling real-time UI updates without polling.

EndpointMethodDescription
/api/documents/uploadPOSTUpload files (multipart, max 50MB default)
/api/documentsGETList with category, status, limit filters
/api/documents/:idGETFull details: OCR text, summary, metadata
  • Upload dialog — Drag-and-drop zone with file picker, multi-file support
  • Document list — Cards with category badge, status badge, size, date, summary preview
  • Queue banner — Shows processing progress (“2 queued, 1 processing”)
  • Detail modal — Full OCR text (scrollable), LLM summary, processing timestamps, metadata
  • Filters — Search by name, filter by category pills and status dropdown
SettingDefaultDescription
workspace.documentsPath./workspace/documentsBase storage directory
workspace.maxUploadSize52428800 (50MB)Maximum upload size in bytes
workspace.ocrEndpointhttp://localhost:11435Ollama endpoint for OCR model
workspace.ocrModelglm-ocrVision model used for image/PDF text extraction