Skip to content

Model Management

Octipus’s model management system provides a unified interface for working with multiple LLM providers, with automatic routing, failover, cost tracking, and health monitoring.

Native integrations that communicate directly with the provider API:

ProviderDescription
OllamaLocal LLM inference — auto-detected from OLLAMA_URL
OpenAIOpenAI API — API key stored in vault
AnthropicAnthropic API — API key stored in vault
GeminiGoogle Gemini API — API key stored in vault
OpenRouterAccess 200+ models through a single API — credit-based billing, automatic routing

Free subscription-based models via CLI tools:

ProviderDescription
Claude CodeAnthropic’s Claude via the Claude Code CLI
Gemini CLIGoogle’s Gemini via the Gemini CLI
Codex CLIOpenAI’s Codex via the Codex CLI

CLI providers are automatically detected and registered when available on the system.

An optional unified proxy that serves as a catch-all fallback. LiteLLM can route to 100+ providers through a single OpenAI-compatible API.

LITELLM_URL=http://localhost:4000

The provider router tries models in priority order:

  1. CLI models — free subscription-based (Claude Code, Gemini CLI, Codex)
  2. Ollama — local models
  3. OpenAI — cloud API
  4. Anthropic — cloud API
  5. Gemini — cloud API
  6. OpenRouter — multi-model proxy with credit tracking
  7. LiteLLM — catch-all proxy

Models can be assigned to topics with primary and backup roles for automatic failover:

ConceptDescription
TopicA category like coding, analysis, general
PrimaryThe preferred model for a topic
BackupThe fallback model if the primary is unavailable

Configure topic routing through the Models page in the web UI or the API.

All model configurations are stored in the database (not environment variables):

  • Default model: One model is marked as the default for unrouted messages
  • Per-model settings: Enable/disable, provider, topic roles, custom parameters
  • Extra body parameters: Per-model custom parameters via metadata.extraBody (e.g., { think: false } for Qwen3)

Before registering a model, you can validate connectivity:

POST /api/models/test

This endpoint checks LiteLLM first, then direct Ollama, and supports namespaced model IDs.

The system tracks per-model token costs:

  • Input tokens: Tokens sent to the model
  • Output tokens: Tokens generated by the model
  • Cost calculation: Based on per-model pricing configuration
  • Aggregation: Costs aggregated by model, time period, and user

Redis-backed daily usage tracking prevents exceeding provider limits:

  • Daily quotas: Track usage per model per day
  • Auto-clearing: Quotas reset automatically at the start of each day
  • Exhaustion detection: When a model’s quota is exhausted, the router automatically falls back to the next available model

Periodic health monitoring for all configured providers:

  • Latency measurement: Tracks response times for each provider
  • Availability status: Marks providers as healthy or unhealthy
  • Auto-recovery: Unhealthy providers are re-checked periodically and restored when available

Access health status via:

GET /api/models/health

Each provider has independent rate limiting with adaptive concurrency and circuit breaker protection. This prevents overloading providers and handles transient failures gracefully.

Rate limiting is configured per-provider for:

  • Ollama — local inference (concurrency limited by GPU memory)
  • OpenAI — cloud API (tokens-per-minute and requests-per-minute)
  • Anthropic — cloud API (requests-per-minute)
  • Gemini — cloud API (requests-per-minute)
  • DeepSeek — cloud API (requests-per-minute)
  • OpenRouter — multi-model proxy (credit-based limits)
  • LiteLLM — proxy (inherits downstream limits)

The rate limiter dynamically adjusts the number of concurrent requests based on provider response times:

  • Scale up: When responses are fast, concurrency increases to maximize throughput
  • Scale down: When latency rises or errors occur, concurrency decreases to reduce pressure
  • Per-provider: Each provider has its own concurrency window

When a provider experiences repeated failures, the circuit breaker trips to prevent cascading issues:

StateBehavior
ClosedNormal operation — requests flow through
OpenProvider is failing — requests are immediately rejected and routed to fallback
Half-OpenAfter a cooldown period, a single test request is sent to check recovery

The circuit breaker transitions back to Closed once the provider responds successfully in the half-open state. This integrates with the provider router’s failover logic to automatically route requests to healthy providers.

LITELLM_URL=http://localhost:4000 # LiteLLM proxy (optional)
OLLAMA_URL=http://localhost:11434 # Local Ollama (optional)
OPENROUTER_API_KEY=sk-or-... # OpenRouter API key (optional)

OpenRouter provides access to 200+ models from multiple providers through a single API key:

  1. Create an account at openrouter.ai and add credits
  2. Store your API key in the vault as openrouter_api_key or set OPENROUTER_API_KEY
  3. Register models with provider: 'openrouter' and modelId in provider/model format (e.g., minimax/minimax-01, nvidia/llama-3.1-nemotron-ultra-253b-v1)
  4. Use the OpenRouter model search in the web UI to browse and register available models
MethodEndpointDescription
GET/api/modelsList all models
POST/api/modelsRegister a new model
POST/api/models/testTest model connectivity
POST/api/models/:id/defaultSet as default model
GET/api/models/routingView topic routing
GET/api/models/healthProvider health status
GET/api/models/cli/statusCLI tool availability
GET/api/models/cli/quotaCLI quota status