Skip to content

Model Management

Octipus’s model management system provides a unified interface for working with multiple LLM providers, with automatic routing, failover, cost tracking, and health monitoring.

Native integrations that communicate directly with the provider API:

ProviderDescription
OllamaLocal LLM inference — auto-detected from OLLAMA_URL
OpenAIOpenAI API — API key stored in vault
AnthropicAnthropic API — API key stored in vault
GeminiGoogle Gemini API — API key stored in vault
OpenRouterAccess 200+ models through a single API — credit-based billing, automatic routing

Run models through their vendor CLI as a subprocess. Octipus spawns the CLI, streams in/out, and surfaces the output as a normal agent turn.

ProviderDescription
Claude CodeAnthropic’s Claude via the Claude Code CLI
Gemini CLIGoogle’s Gemini via the Gemini CLI
Codex CLIOpenAI’s Codex via the Codex CLI

CLI providers are auto-detected and registered when the binary is on PATH.

An optional unified proxy that serves as a catch-all fallback. LiteLLM can route to 100+ providers through a single OpenAI-compatible API.

LITELLM_URL=http://localhost:4000

The provider router tries models in priority order:

  1. CLI models (Claude Code, Gemini CLI, Codex) — billing and quota handled by each vendor
  2. Ollama — local models
  3. OpenAI — cloud API
  4. Anthropic — cloud API
  5. Gemini — cloud API
  6. Grok — cloud API
  7. DeepSeek — cloud API
  8. OpenRouter — multi-model proxy with credit tracking
  9. Voyage — embeddings only
  10. Custom OpenAI-compatible — self-hosted vLLM, Together, Fireworks, etc. (DB-configured per model)
  11. Custom Gemini-compatible — Gemini API-compatible upstreams (DB-configured per model)
  12. LiteLLM — catch-all proxy (optional, only if configured)

Models must be explicitly assigned to topics. When an agent is spawned for a role, it uses the model assigned to that role’s topic.

ConceptDescription
TopicA category like coding, research, general
PrimaryThe preferred model for a topic
BackupThe fallback model if the primary is unavailable
Unbound topicIf no model is assigned, the agent fails to spawn with a clear error message — there is no silent fallback to the default model

Configure topic routing through the Models page in the web UI or the API. Every role/topic combination you use must have at least a primary model assigned.

Octipus discovers available models directly from each provider’s list endpoint. There is no static “recommended models” file in the repo — what you see in the picker is whatever the vendor returned on the most recent fetch, run through deterministic curation rules.

ProviderEndpointDiscovery file
OpenAIGET /v1/modelssrc/models/providers/discovery/openai.ts
AnthropicGET /v1/models (with anthropic-version)src/models/providers/discovery/anthropic.ts
Google GeminiGET /v1beta/modelssrc/models/providers/discovery/gemini.ts
OpenRouterGET /api/v1/modelssrc/models/providers/discovery/openrouter.ts
OllamaGET /api/tagssrc/models/providers/discovery/ollama.ts

Each fresh fetch is filtered and sorted by src/models/providers/discovery/curation.ts:

  • Recency window — drop models older than ~18 months (when the API exposes a date)
  • Capability gate — drop models the API marks as non-tool-capable (OpenRouter); drop embedding/Whisper/TTS/DALL·E/Imagen/Veo/AQA by id pattern
  • Family deduplication — collapse dated snapshots (claude-sonnet-4-5-20250929) to their alias (claude-sonnet-4-5) when the alias is also returned
  • Preview filter — hide preview/experimental/dated unless ?preview=true is passed
  • Tier inference — group results into flagship / balanced / reasoning / cheap from id heuristics
  • Sort — flagship → balanced → reasoning → cheap, then createdAt desc

Cached for 6h in Valkey; stale-while-revalidate on errors. Force a refresh with ?refresh=true.

For endpoints that aren’t backed by a first-party provider class, Octipus offers two custom-provider flavors — pick the one that matches the upstream wire format:

Flavorprovider valueWire formatUse for
Custom OpenAI-compatiblecustom-openaiOpenAI /v1/chat/completionsvLLM, Together, Groq, Fireworks, DeepInfra, internal OpenAI-shaped proxies
Custom Gemini-compatiblecustom-geminiNative Google Gemini (candidates[].content.parts[])Vertex AI, Google AI Studio (native), Gemini-fronting proxies

Configuration is per model, not per provider — the endpoint URL and key reference live on each model row, so you can register several different upstreams side by side. Each model carries its own apiKeyRef (a vault entry name, or an env:VAR_NAME reference), so there is no single shared key.

  1. Secrets page → add a vault entry (e.g. together_api_key) with the upstream’s bearer token. (Or skip this and reference an env var directly with apiKeyRef: 'env:TOGETHER_API_KEY'.)
  2. Models pageAdd Model with:
    • Provider: custom-openai or custom-gemini
    • Endpoint URL: the base URL (no trailing slash) — include /v1 if an OpenAI-compatible upstream expects it (e.g. https://api.together.xyz/v1, http://my-vllm:8000/v1)
    • Model ID: the model name the upstream uses (e.g. meta-llama/Llama-3.3-70B-Instruct-Turbo)
    • API key reference: the vault entry name or env:VAR_NAME for this model
    • Auth scheme: bearer (default), header (custom header name), or query (query param)

The Test button validates connectivity against the configured endpoint and auth scheme.

See Custom Providers for the full schema — auth schemes, apiKeyRef resolution order, Gemini request envelopes, and tool-calling support.

All model configurations are stored in the database (not environment variables):

  • Default model: One model is marked as the default for unrouted messages
  • Per-model settings: Enable/disable, provider, topic roles, custom parameters
  • Extra body parameters: Per-model custom parameters via metadata.extraBody (e.g., { think: false } for Qwen3)

Before registering a model, you can validate connectivity:

POST /api/models/test

This endpoint checks LiteLLM first, then direct Ollama, and supports namespaced model IDs.

The system tracks per-model token costs:

  • Input tokens: Tokens sent to the model
  • Output tokens: Tokens generated by the model
  • Cost calculation: Based on per-model pricing configuration
  • Aggregation: Costs aggregated by model, time period, and user

Valkey-backed daily usage tracking prevents exceeding provider limits:

  • Daily quotas: Track usage per model per day
  • Auto-clearing: Quotas reset automatically at the start of each day
  • Exhaustion detection: When a model’s quota is exhausted, the router automatically falls back to the next available model

Periodic health monitoring for all configured providers:

  • Latency measurement: Tracks response times for each provider
  • Availability status: Marks providers as healthy or unhealthy
  • Auto-recovery: Unhealthy providers are re-checked periodically and restored when available

Access health status via:

GET /api/models/health

Each provider has independent rate limiting with adaptive concurrency and circuit breaker protection. This prevents overloading providers and handles transient failures gracefully.

Rate limiting is configured per-provider for:

  • Ollama — local inference (concurrency limited by GPU memory)
  • OpenAI — cloud API (tokens-per-minute and requests-per-minute)
  • Anthropic — cloud API (requests-per-minute)
  • Gemini — cloud API (requests-per-minute)
  • DeepSeek — cloud API (requests-per-minute)
  • OpenRouter — multi-model proxy (credit-based limits)
  • LiteLLM — proxy (inherits downstream limits)

The rate limiter dynamically adjusts the number of concurrent requests based on provider response times:

  • Scale up: When responses are fast, concurrency increases to maximize throughput
  • Scale down: When latency rises or errors occur, concurrency decreases to reduce pressure
  • Per-provider: Each provider has its own concurrency window

When a provider experiences repeated failures, the circuit breaker trips to prevent cascading issues:

StateBehavior
ClosedNormal operation — requests flow through
OpenProvider is failing — requests are immediately rejected and routed to fallback
Half-OpenAfter a cooldown period, a single test request is sent to check recovery

The circuit breaker transitions back to Closed once the provider responds successfully in the half-open state. This integrates with the provider router’s failover logic to automatically route requests to healthy providers.

LITELLM_URL=http://localhost:4000 # LiteLLM proxy (optional)
OLLAMA_URL=http://localhost:11434 # Local Ollama (optional)
OPENROUTER_API_KEY=sk-or-... # OpenRouter API key (optional)

OpenRouter provides access to 200+ models from multiple providers through a single API key:

  1. Create an account at openrouter.ai and add credits
  2. Store your API key in the vault as openrouter_api_key or set OPENROUTER_API_KEY
  3. Register models with provider: 'openrouter' and modelId in provider/model format (e.g., minimax/minimax-01, nvidia/llama-3.1-nemotron-ultra-253b-v1)
  4. Use the OpenRouter model search in the web UI to browse and register available models
MethodEndpointDescription
GET/api/modelsList all models
POST/api/modelsRegister a new model
POST/api/models/testTest model connectivity
POST/api/models/:id/defaultSet as default model
GET/api/models/routingView topic routing
GET/api/models/healthProvider health status
GET/api/models/cli/statusCLI tool availability
GET/api/models/cli/quotaCLI quota status