Model Management

Octipus’s model management system provides a unified interface for working with multiple LLM providers, with automatic routing, failover, cost tracking, and health monitoring.

Provider Types

Direct Providers

Native integrations that communicate directly with the provider API:

Provider	Description
Ollama	Local LLM inference — auto-detected from `OLLAMA_URL`
OpenAI	OpenAI API — API key stored in vault
Anthropic	Anthropic API — API key stored in vault
Gemini	Google Gemini API — API key stored in vault
OpenRouter	Access 200+ models through a single API — credit-based billing, automatic routing

CLI Providers

Free subscription-based models via CLI tools:

Provider	Description
Claude Code	Anthropic’s Claude via the Claude Code CLI
Gemini CLI	Google’s Gemini via the Gemini CLI
Codex CLI	OpenAI’s Codex via the Codex CLI

CLI providers are automatically detected and registered when available on the system.

LiteLLM Proxy

An optional unified proxy that serves as a catch-all fallback. LiteLLM can route to 100+ providers through a single OpenAI-compatible API.

LITELLM_URL=http://localhost:4000

Provider Router

The provider router tries models in priority order:

CLI models — free subscription-based (Claude Code, Gemini CLI, Codex)
Ollama — local models
OpenAI — cloud API
Anthropic — cloud API
Gemini — cloud API
OpenRouter — multi-model proxy with credit tracking
LiteLLM — catch-all proxy

Topic Routing

Models can be assigned to topics with primary and backup roles for automatic failover:

Concept	Description
Topic	A category like `coding`, `analysis`, `general`
Primary	The preferred model for a topic
Backup	The fallback model if the primary is unavailable

Configure topic routing through the Models page in the web UI or the API.

Model Registry

All model configurations are stored in the database (not environment variables):

Default model: One model is marked as the default for unrouted messages
Per-model settings: Enable/disable, provider, topic roles, custom parameters
Extra body parameters: Per-model custom parameters via metadata.extraBody (e.g., { think: false } for Qwen3)

Model Test Endpoint

Before registering a model, you can validate connectivity:

POST /api/models/test

This endpoint checks LiteLLM first, then direct Ollama, and supports namespaced model IDs.

Cost Tracking

The system tracks per-model token costs:

Input tokens: Tokens sent to the model
Output tokens: Tokens generated by the model
Cost calculation: Based on per-model pricing configuration
Aggregation: Costs aggregated by model, time period, and user

Quota Tracking

Redis-backed daily usage tracking prevents exceeding provider limits:

Daily quotas: Track usage per model per day
Auto-clearing: Quotas reset automatically at the start of each day
Exhaustion detection: When a model’s quota is exhausted, the router automatically falls back to the next available model

Health Checks

Periodic health monitoring for all configured providers:

Latency measurement: Tracks response times for each provider
Availability status: Marks providers as healthy or unhealthy
Auto-recovery: Unhealthy providers are re-checked periodically and restored when available

Access health status via:

GET /api/models/health

Per-Provider Rate Limiting

Each provider has independent rate limiting with adaptive concurrency and circuit breaker protection. This prevents overloading providers and handles transient failures gracefully.

Supported Providers

Rate limiting is configured per-provider for:

Ollama — local inference (concurrency limited by GPU memory)
OpenAI — cloud API (tokens-per-minute and requests-per-minute)
Anthropic — cloud API (requests-per-minute)
Gemini — cloud API (requests-per-minute)
DeepSeek — cloud API (requests-per-minute)
OpenRouter — multi-model proxy (credit-based limits)
LiteLLM — proxy (inherits downstream limits)

Adaptive Concurrency

The rate limiter dynamically adjusts the number of concurrent requests based on provider response times:

Scale up: When responses are fast, concurrency increases to maximize throughput
Scale down: When latency rises or errors occur, concurrency decreases to reduce pressure
Per-provider: Each provider has its own concurrency window

Circuit Breaker

When a provider experiences repeated failures, the circuit breaker trips to prevent cascading issues:

State	Behavior
Closed	Normal operation — requests flow through
Open	Provider is failing — requests are immediately rejected and routed to fallback
Half-Open	After a cooldown period, a single test request is sent to check recovery

The circuit breaker transitions back to Closed once the provider responds successfully in the half-open state. This integrates with the provider router’s failover logic to automatically route requests to healthy providers.

Configuration

Environment Variables

LITELLM_URL=http://localhost:4000      # LiteLLM proxy (optional)
OLLAMA_URL=http://localhost:11434      # Local Ollama (optional)
OPENROUTER_API_KEY=sk-or-...           # OpenRouter API key (optional)

OpenRouter Setup

OpenRouter provides access to 200+ models from multiple providers through a single API key:

Create an account at openrouter.ai and add credits
Store your API key in the vault as openrouter_api_key or set OPENROUTER_API_KEY
Register models with provider: 'openrouter' and modelId in provider/model format (e.g., minimax/minimax-01, nvidia/llama-3.1-nemotron-ultra-253b-v1)
Use the OpenRouter model search in the web UI to browse and register available models

Useful API Endpoints

Method	Endpoint	Description
`GET`	`/api/models`	List all models
`POST`	`/api/models`	Register a new model
`POST`	`/api/models/test`	Test model connectivity
`POST`	`/api/models/:id/default`	Set as default model
`GET`	`/api/models/routing`	View topic routing
`GET`	`/api/models/health`	Provider health status
`GET`	`/api/models/cli/status`	CLI tool availability
`GET`	`/api/models/cli/quota`	CLI quota status