Skip to content

Voice & Phone Calls

Octipus includes voice capabilities at two levels: local voice I/O for hands-free interaction, and phone calls for making and receiving actual calls through telephony providers.

EngineTypeDescription
Whisper.cppLocalC++ Whisper — fast, private, fully offline
Faster-WhisperLocalPython + CTranslate2 — optimized for speed
OpenAI WhisperCloudAPI fallback when local is unavailable

API: POST /api/voice/transcribe — send base64-encoded audio, receive transcribed text.

EngineTypeDescription
PiperLocalNeural TTS — fast, high quality, fully offline
Edge TTSCloudMicrosoft Edge TTS — 200+ neural voices
CoquiLocalNeural TTS — multi-language support
EngineTypeDescription
Sherpa-ONNXLocalONNX-based keyword spotting — fast, private
Picovoice/PorcupineCloudHigh accuracy — requires API key
VAD FallbackLocalVoice Activity Detection — triggers on any speech

Make and receive actual phone calls through telephony providers. Conversations are powered by the same LLM models you use for chat — with a fast, direct path optimized for voice latency.

ProviderProtocolWhat You Need
TwilioProgrammable Voice + Media StreamsAccount SID + Auth Token
TelnyxCall Control v2API Key + Connection ID
PlivoVoice API + XMLAuth ID + Auth Token

All three providers support outbound calls, inbound calls, and webhook-based conversation flow. Phone numbers are auto-detected from your provider account.

1. Store credentials in the vault

Add your provider’s API credentials in Settings > Vault:

For Twilio: twilio_account_sid and twilio_auth_token For Telnyx: telnyx_api_key and telnyx_connection_id For Plivo: plivo_auth_id and plivo_auth_token

2. Configure the provider

Set two settings (via Settings page or API):

SettingValueExample
voice.telephonyProviderYour provider nametwilio
voice.publicUrlYour public webhook URLhttps://abc123.ngrok.io

3. Assign a model to the “voice” topic

In the Models page, assign a fast model to the voice topic. Local Ollama models give the lowest latency.

That’s it — no other configuration needed. The expert prompt is loaded from the Communicator expert automatically.

  1. Create a Twilio account at twilio.com
  2. Get credentials from Twilio Console → Account → API keys & tokens:
    • Account SID: starts with AC + 32 hex chars (34 total)
    • Auth Token: 32 hex chars
  3. Store in vault: Add twilio_account_sid and twilio_auth_token in Settings → Vault
  4. Phone number: Auto-detected from your Twilio account — no manual config needed. Buy a number in the Twilio Console if you don’t have one.
  5. Set webhook URL: Must be publicly accessible (use ngrok or Cloudflare Tunnel)
  6. Configure Twilio webhook: Point your Twilio phone number’s webhook to https://your-url/api/voice/webhook/twilio
ProblemCauseFix
HTTP 401Auth Token mismatchRegenerate in Twilio Console → API keys & tokens
HTTP 403Account suspended, old token, or sub-account mismatchVerify in Twilio Console
HTTP 404Invalid Account SIDCheck vault secret twilio_account_sid
No phone number detectedAccount has no numbersBuy one in Twilio Console
Webhook timeoutNetwork/firewallCheck public URL accessibility
FeatureTwilioTelnyxPlivo
AuthBasic (SID:Token)Bearer tokenBasic (ID:Token)
Call ControlTwiML (XML)Call Control v2 (JSON)XML
Speech Gather<Gather>API command<GetInput>
Webhook VerifyHMAC-SHA1Ed25519HMAC-SHA256
End CallPOST Status=completedPOST hangup actionDELETE
Phone DetectionAuto from accountManual configManual config

The agent speaks a message and hangs up. Good for alerts, reminders, and status updates.

User: "Call +1234567890 and tell them the server is back up"
→ Agent uses initiate_call tool (notify mode)
→ Provider dials the number
→ Person answers → TTS speaks message → Hangup

Interactive voice exchange with the caller. The assistant listens, thinks, and responds — like a phone conversation.

User: "Call the client and discuss the project timeline"
→ Agent uses initiate_call tool (conversation mode)
→ Provider dials → Person answers → TTS speaks greeting
→ Person speaks → STT transcribes → LLM responds → TTS speaks back
→ Repeat until either side hangs up

Voice calls bypass the orchestrator entirely for low latency:

Caller speaks → Provider STT (~1s) → Direct LLM call (~1-3s) → Provider TTS (~0.5s)

~2-5 seconds per turn — no classification, no worker spawning, no tool execution. The model assigned to the voice topic is called directly with the Communicator expert’s system prompt plus voice-specific instructions.

Agents interact with phone calls through 5 tool actions:

ActionDescriptionPermission
initiate_callStart a call (notify or conversation mode)Requires approval
continue_callSend next message in active conversationAuto-approved
end_callHang up an active callAuto-approved
get_statusCheck call stateAuto-approved
list_callsList all active callsAuto-approved

The voice_call tool is assigned to the Communication role and available to all communication-focused agents.

Inbound calls are disabled by default for security. To enable:

SettingValueDescription
voice.inboundPolicyallowlistOnly accept calls from listed numbers
voice.inboundAllowFrom["+1234567890"]Allowed caller phone numbers (E.164)

Set voice.inboundPolicy to open to accept calls from any number (not recommended for production).

Configure your provider to send webhooks to:

https://your-public-url/api/voice/webhook/twilio

(Replace twilio with telnyx or plivo as appropriate.)

MethodPathDescription
POST/api/voice/transcribeTranscribe audio (local STT)
GET/api/voice/statusVoice subsystem status
POST/api/voice/webhook/:providerTelephony webhook
GET/api/voice/callsList active calls
GET/api/voice/telephony/healthProvider health check
  • Webhook signature verification for all providers (Twilio HMAC-SHA1, Telnyx Ed25519, Plivo HMAC-SHA256)
  • Permission gating — initiating calls requires user approval
  • Inbound filtering — allowlist-based caller ID filtering
  • Conversation isolation — each call has its own context, no cross-call leakage