Voice & Phone Calls

Octipus includes voice capabilities at two levels: local voice I/O for hands-free interaction, and phone calls for making and receiving actual calls through telephony providers.

Local Voice I/O

Speech-to-Text (STT)

Engine	Type	Description
Whisper.cpp	Local	C++ Whisper — fast, private, fully offline
Faster-Whisper	Local	Python + CTranslate2 — optimized for speed
OpenAI Whisper	Cloud	API fallback when local is unavailable

API: POST /api/voice/transcribe — send base64-encoded audio, receive transcribed text.

Text-to-Speech (TTS)

Engine	Type	Description
Piper	Local	Neural TTS — fast, high quality, fully offline
Edge TTS	Cloud	Microsoft Edge TTS — 200+ neural voices
Coqui	Local	Neural TTS — multi-language support

Wake Word Detection

Engine	Type	Description
Sherpa-ONNX	Local	ONNX-based keyword spotting — fast, private
Picovoice/Porcupine	Cloud	High accuracy — requires API key
VAD Fallback	Local	Voice Activity Detection — triggers on any speech

Phone Calls

Make and receive actual phone calls through telephony providers. Conversations are powered by the same LLM models you use for chat — with a fast, direct path optimized for voice latency.

Supported Providers

Provider	Protocol	What You Need
Twilio	Programmable Voice + Media Streams	Account SID + Auth Token
Telnyx	Call Control v2	API Key + Connection ID
Plivo	Voice API + XML	Auth ID + Auth Token

All three providers support outbound calls, inbound calls, and webhook-based conversation flow. Phone numbers are auto-detected from your provider account.

Quick Setup

1. Store credentials in the vault

Add your provider’s API credentials in Settings > Vault:

For Twilio: twilio_account_sid and twilio_auth_token For Telnyx: telnyx_api_key and telnyx_connection_id For Plivo: plivo_auth_id and plivo_auth_token

2. Configure the provider

Set two settings (via Settings page or API):

Setting	Value	Example
`voice.telephonyProvider`	Your provider name	`twilio`
`voice.publicUrl`	Your public webhook URL	`https://abc123.ngrok.io`

3. Assign a model to the “voice” topic

In the Models page, assign a fast model to the voice topic. Local Ollama models give the lowest latency.

That’s it — no other configuration needed. The expert prompt is loaded from the Communicator expert automatically.

Twilio Setup Guide

Create a Twilio account at twilio.com
Get credentials from Twilio Console → Account → API keys & tokens:
- Account SID: starts with AC + 32 hex chars (34 total)
- Auth Token: 32 hex chars
Store in vault: Add twilio_account_sid and twilio_auth_token in Settings → Vault
Phone number: Auto-detected from your Twilio account — no manual config needed. Buy a number in the Twilio Console if you don’t have one.
Set webhook URL: Must be publicly accessible (use ngrok or Cloudflare Tunnel)
Configure Twilio webhook: Point your Twilio phone number’s webhook to https://your-url/api/voice/webhook/twilio

Common Issues

Problem	Cause	Fix
HTTP 401	Auth Token mismatch	Regenerate in Twilio Console → API keys & tokens
HTTP 403	Account suspended, old token, or sub-account mismatch	Verify in Twilio Console
HTTP 404	Invalid Account SID	Check vault secret `twilio_account_sid`
No phone number detected	Account has no numbers	Buy one in Twilio Console
Webhook timeout	Network/firewall	Check public URL accessibility

Provider Comparison

Feature	Twilio	Telnyx	Plivo
Auth	Basic (SID:Token)	Bearer token	Basic (ID:Token)
Call Control	TwiML (XML)	Call Control v2 (JSON)	XML
Speech Gather	`<Gather>`	API command	`<GetInput>`
Webhook Verify	HMAC-SHA1	Ed25519	HMAC-SHA256
End Call	POST Status=completed	POST hangup action	DELETE
Phone Detection	Auto from account	Manual config	Manual config

How It Works

Outbound Calls — Notify Mode

The agent speaks a message and hangs up. Good for alerts, reminders, and status updates.

User: "Call +1234567890 and tell them the server is back up"
→ Agent uses initiate_call tool (notify mode)
→ Provider dials the number
→ Person answers → TTS speaks message → Hangup

Outbound Calls — Conversation Mode

Interactive voice exchange with the caller. The assistant listens, thinks, and responds — like a phone conversation.

User: "Call the client and discuss the project timeline"
→ Agent uses initiate_call tool (conversation mode)
→ Provider dials → Person answers → TTS speaks greeting
→ Person speaks → STT transcribes → LLM responds → TTS speaks back
→ Repeat until either side hangs up

Fast Conversation Path

Voice calls bypass the orchestrator entirely for low latency:

Caller speaks → Provider STT (~1s) → Direct LLM call (~1-3s) → Provider TTS (~0.5s)

~2-5 seconds per turn — no classification, no worker spawning, no tool execution. The model assigned to the voice topic is called directly with the Communicator expert’s system prompt plus voice-specific instructions.

Tool Actions

Agents interact with phone calls through 5 tool actions:

Action	Description	Permission
`initiate_call`	Start a call (notify or conversation mode)	Requires approval
`continue_call`	Send next message in active conversation	Auto-approved
`end_call`	Hang up an active call	Auto-approved
`get_status`	Check call state	Auto-approved
`list_calls`	List all active calls	Auto-approved

The voice_call tool is assigned to the Communication role and available to all communication-focused agents.

Inbound Calls

Inbound calls are disabled by default for security. To enable:

Setting	Value	Description
`voice.inboundPolicy`	`allowlist`	Only accept calls from listed numbers
`voice.inboundAllowFrom`	`["+1234567890"]`	Allowed caller phone numbers (E.164)

Set voice.inboundPolicy to open to accept calls from any number (not recommended for production).

Configure your provider to send webhooks to:

https://your-public-url/api/voice/webhook/twilio

(Replace twilio with telnyx or plivo as appropriate.)

API Endpoints

Method	Path	Description
`POST`	`/api/voice/transcribe`	Transcribe audio (local STT)
`GET`	`/api/voice/status`	Voice subsystem status
`POST`	`/api/voice/webhook/:provider`	Telephony webhook
`GET`	`/api/voice/calls`	List active calls
`GET`	`/api/voice/telephony/health`	Provider health check

Security

Webhook signature verification for all providers (Twilio HMAC-SHA1, Telnyx Ed25519, Plivo HMAC-SHA256)
Permission gating — initiating calls requires user approval
Inbound filtering — allowlist-based caller ID filtering
Conversation isolation — each call has its own context, no cross-call leakage