Small / Local Models
Octipus is built to run fully self-hosted on local models (Ollama). This guide covers the realistic setup for a small machine — a single chat model around or below ~10B parameters — what works, what degrades, and how to configure it.
The realistic minimum: 1 chat model + 1 embedding model
Section titled “The realistic minimum: 1 chat model + 1 embedding model”“One model” can’t literally be one model. Embedding and vision are different model classes — a chat model cannot produce embeddings or read images. So the practical minimum is:
| Role | Model class | Example (Ollama) | Required? |
|---|---|---|---|
| All text work (chat, routing, specialists, memory) | chat / instruct | qwen2.5:7b, glm-4.x-flash, llama3.1:8b | Yes |
| RAG + long-term memory | embedding | nomic-embed-text | Strongly recommended |
| Documents / images (OCR, vision) | vision | llava, a -vl model | Optional |
Without an embedding model, RAG and long-term memory recall degrade (the knowledge-base readiness check returns 503). Without a vision model, document OCR / image features are unavailable. These are intended fail-loud boundaries, not bugs.
Pick a model that can actually call tools
Section titled “Pick a model that can actually call tools”The real bottleneck on small models is not prompt length — it’s reliable tool-call JSON. A model that can’t emit valid tool calls will fail at agent work even though everything else is configured correctly.
- Known-good local tool-callers:
qwen2.5:32b,glm-4.x-flash, and other proven instruct models. - Known-bad: the qwen3 family via Ollama emits malformed tool-call JSON and is blocked from orchestration automatically.
- Verify any model before relying on it:
POST /api/models/:name/check-capabilitiesruns a tool-calling + JSON conformance probe and returns acapable/incapableverdict.
Configuration
Section titled “Configuration”1. Bootstrap with a single model
Section titled “1. Bootstrap with a single model”Set the BOOTSTRAP_* env vars and Octipus seeds one model on first boot, bound to all text topics (not just general) so routing to any specialist works:
BOOTSTRAP_PROVIDER=ollamaBOOTSTRAP_MODEL=qwen2.5:7bBOOTSTRAP_BASE_URL=http://localhost:114342. Or adopt the single-model setup on an existing install
Section titled “2. Or adopt the single-model setup on an existing install”In the Models page, use the “Use for all topics” action on a model (the layers icon), or call the API directly:
curl -X POST http://localhost:3005/api/models/<name>/use-for-all-topicsThis binds the model to every text topic and makes it the default. The response lists embedding / ocr / vision as still unbound — add those separately.
3. Add an embedding model
Section titled “3. Add an embedding model”Register a second model (e.g. nomic-embed-text) and bind it to the embedding topic via the Models page.
Relevant settings
Section titled “Relevant settings”| Setting | Env var | Default | Purpose |
|---|---|---|---|
orchestrator.mode | ORCHESTRATOR_MODE | auto | auto derives the mode from model size; pin to router to force the small-model path. |
orchestrator.routerSmallModelMaxParams | ORCHESTRATOR_ROUTER_MAX_PARAMS | 10e9 | Below this the orchestrator runs in router mode (no orchestrator LLM) and workers run in the small tier. |
orchestrator.smallModelMaxTools | ORCHESTRATOR_SMALL_MODEL_MAX_TOOLS | 7 | Max tools handed to a small-tier worker — fewer tools, more reliable tool calls. |
How Octipus adapts to a small model
Section titled “How Octipus adapts to a small model”Orchestrator — chosen automatically from the default model’s size:
router(< 10B): no orchestrator LLM. A keyword classifier routes each message to one specialist, which does the work; the result is relayed. No parallel swarms, pipelines, or multi-step planning.lite(10–24B): a shrunken single-step orchestrator.full(≥ 24B): the complete swarm orchestrator.
Workers — when the bound model is small-tier, each worker automatically:
- caps its tool list to
smallModelMaxTools, - drops the heavy expert scaffold (deliverable template, success metrics) and uses compact response guidelines,
- injects the skill index instead of full skill bodies,
- skips the MCP meta-tool guidance.
Automated tasks request JSON mode (Ollama native format: json) so extraction / judgment / research return parseable output instead of prose.
What works vs. what degrades
Section titled “What works vs. what degrades”| Capability | On one small chat model |
|---|---|
| Casual chat, single-specialist routing | ✅ Works |
| Simple coding / edits, classification, short summaries | ✅ Works (with a reliable tool-caller) |
| Memory extraction, context compaction, email/doc summaries, email drafts | ⚠️ Usable, lower quality |
| RAG + long-term memory | Needs an embedding model |
| Document OCR / vision | Needs a vision model |
| Deep research synthesis, weekly knowledge review | ❌ Unreliable on small models |
| Parallel swarms / pipelines | ❌ Disabled in router mode by design |
Troubleshooting
Section titled “Troubleshooting”- “No model bound to topic X” — a worker topic is unbound. Use Use for all topics, or bind the topic in the Models page.
- Agent fails with malformed tool-call JSON — the model is a weak tool-caller. Run
check-capabilitiesand switch to a known-good model. - RAG / memory returns nothing — bind an
embeddingmodel. - Mode isn’t what you expect —
orchestrator.modeisauto; check the default model’s size, or pin the mode explicitly.