Knowledge Base (RAG)

The knowledge base provides Retrieval-Augmented Generation (RAG) capabilities. Text is chunked, embedded, and stored in PostgreSQL with pgvector. The system supports hybrid search combining BM25 full-text search with vector cosine similarity, merged via Reciprocal Rank Fusion.

Architecture

Component	Technology	Purpose
Embedding model	`nomic-embed-text` via Ollama/LiteLLM	Generate 768-dim embeddings
Vector storage	PostgreSQL + pgvector	HNSW cosine similarity index
Full-text search	PostgreSQL tsvector + GIN	BM25 ranking via `ts_rank`
Hybrid search	Reciprocal Rank Fusion (RRF)	Merge BM25 and cosine results
Chunk size	1000 characters	With metadata (filePath, chunkIndex, language)

Search Modes

Runs both BM25 full-text and vector cosine searches independently, then merges the ranked result lists using Reciprocal Rank Fusion. Best overall recall.

curl -X POST http://localhost:3005/api/knowledge/search \
  -H "Authorization: Bearer <token>" \
  -d '{"query": "authentication flow", "mode": "hybrid"}'

Cosine similarity against the embedding column. Best for conceptual and semantic queries.

curl -X POST http://localhost:3005/api/knowledge/search \
  -d '{"query": "how does login work", "mode": "semantic"}'

BM25 full-text search using PostgreSQL tsvector + GIN index. Fast for keyword-heavy queries.

curl -X POST http://localhost:3005/api/knowledge/search \
  -d '{"query": "OAuth PKCE token refresh", "mode": "keyword"}'

Tiered Content Loading

Each knowledge entry stores three levels of detail to optimize context usage:

Tier	Field	Description	When Used
L0	`abstract`	2-3 sentence summary	Search result previews
L1	`overview`	Key points overview	Agent context injection
L2	`content`	Full text content	Explicit read via `read_knowledge` tool

Search results return L0/L1 by default. Agents use read_knowledge to load the full L2 content when needed.

How Data Gets Indexed

Automatic Indexing

Agent outputs are automatically indexed after completion when the output exceeds 100 characters. Controlled by the RAG_AUTO_INDEX environment variable (default: true). Indexed with source type agent_output.

Agent Tools

Agents with the knowledge tool can:

search_knowledge — Search with query, limit, source type filter, and mode selection
read_knowledge — Load full L2 content for a specific entry
index_file — Index a file into the knowledge base
index_directory — Index all files in a directory (with optional glob patterns)

Document Pipeline

Uploaded documents are automatically indexed into the knowledge base after OCR text extraction and categorization. See the Document Management page.

MCP Tools

External models (Claude Code, Gemini CLI) can search and index via MCP:

octipus_search_knowledge — Hybrid search
octipus_index_file — Index a file

REST API

Endpoint	Method	Description
`GET /api/knowledge`	GET	Browse entries with `sourceType`, `limit`, `offset` filters
`GET /api/knowledge/stats`	GET	Counts by source type
`GET /api/knowledge/:id`	GET	Full entry content
`POST /api/knowledge/search`	POST	Search with `query`, `mode`, `limit`, `sourceType`
`DELETE /api/knowledge/:id`	DELETE	Delete single entry
`POST /api/knowledge/index`	POST	Index a file or directory

Roles with Knowledge Access

Role	Access	Rationale
research	Yes	Primary knowledge consumer/producer
coding	Yes	Look up past solutions and patterns
review	Yes	Reference past decisions and standards
general	Yes	Broad access for general tasks
ai	Yes	RAG system builder
writing	Yes	Reference existing documentation
data	Yes	Look up schemas and patterns
security	Yes	Reference past audits and findings

UI

WebUI (`/knowledge`)

The knowledge page provides:

Stats bar — Total entries, documents, code, and agent outputs
Search — Large search input with mode toggle (Hybrid / Semantic / Keyword) and source type filter
Browse — Paginated list of all entries when no search is active
Entry cards — Abstract preview, source type badge, file path, date
Detail modal — Full content, abstract, overview, metadata, delete and re-index buttons
Index dialog — Manual file/directory indexing with path, source type, and glob pattern inputs

Database Schema

CREATE TABLE embeddings (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  source_type TEXT NOT NULL,      -- 'document', 'code', 'agent_output'
  source_id TEXT NOT NULL,        -- file path or agent ID
  content TEXT NOT NULL,          -- L2: full text chunk
  abstract TEXT,                  -- L0: 2-3 sentence summary
  overview TEXT,                  -- L1: key points overview
  embedding vector(768) NOT NULL, -- nomic-embed-text dimension
  content_tsv TSVECTOR,           -- auto-populated, used for BM25
  model TEXT NOT NULL,
  metadata JSONB DEFAULT '{}',
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX ON embeddings USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON embeddings USING gin (content_tsv);

Configuration

Variable	Default	Description
`RAG_AUTO_INDEX`	`true`	Auto-index agent outputs on completion

Setup

PostgreSQL with the pgvector extension installed
Pull the embedding model: ollama pull nomic-embed-text
Register the model in LiteLLM config with topic embedding
Run migrations for the embeddings table and hybrid search columns