Testing
Octipus has two test layers: unit tests for individual modules and end-to-end (E2E) tests for the full API.
Unit Tests
Section titled “Unit Tests”Unit tests use Bun’s built-in test runner.
Running Tests
Section titled “Running Tests”bun testbun test src/utils/crypto.test.tsbun test --coverageTest Location
Section titled “Test Location”Unit test files are co-located with their source files using the .test.ts suffix:
src/├── utils/│ ├── crypto.ts│ └── crypto.test.ts├── security/│ ├── permissions.ts│ └── permissions.test.ts└── ...End-to-End Tests
Section titled “End-to-End Tests”The E2E test suite runs 112 tests against the running backend, covering all major API endpoints.
Prerequisites
Section titled “Prerequisites”The backend server must be running before executing E2E tests:
# Start the backend firstbun run dev
# In another terminal, run E2E testsbun run test:e2eTest Organization
Section titled “Test Organization”The E2E suite is located in scripts/e2e/ and organized into test modules:
scripts/e2e/├── runner.ts # TestRunner, assert helpers├── client.ts # APIClient wrapper├── fixtures.ts # Shared test state├── index.ts # Test orchestrator└── tests/ # 22 test modules ├── health.ts ├── auth.ts ├── models.ts ├── vault.ts ├── agents.ts ├── sessions.ts ├── hooks.ts ├── chat.ts ├── documents.ts ├── browser-ext.ts ├── messaging.ts ├── knowledge.ts ├── channels.ts ├── experts.ts ├── recurring-tasks.ts ├── skills.ts └── ...Test Coverage Areas
Section titled “Test Coverage Areas”| Module | What It Tests |
|---|---|
| health | Database, Redis connectivity, health probes |
| auth | Registration, login, session management |
| models | Model CRUD, health, CLI status, usage tracking |
| vault | Credential storage, update, rotation, deletion |
| skills | Skill CRUD, system skill listing |
| skill execution | MCP bridge, tool execution via API |
| agents | Spawn, events, routing, stop, status |
| sessions | CRUD, messages, pagination |
| hooks | Hook creation, enable/disable, event types |
| chat | Message sending, session continuity |
| documents | Document upload, listing, filtering, detail retrieval |
| browser-ext | Browser extension v2 tool registration (24 tools) |
| messaging | Cross-channel messaging tool, list channels |
| knowledge | Hybrid search modes (hybrid, fts, vector), tool registration |
| channels | WhatsApp webhook verification, Teams webhook handling |
| experts | Expert CRUD, system expert listing, expert-routed chat |
| recurring-tasks | Recurring task CRUD, scheduling |
Custom Test Runner
Section titled “Custom Test Runner”The E2E suite uses a custom TestRunner with assert helpers:
// Example test using the TestRunnertest('GET /health returns ok', async () => { const res = await client.get('/health'); assert.equal(res.status, 200); assert.equal(res.data.status, 'ok');});Agent Evaluation Harness
Section titled “Agent Evaluation Harness”The evaluation harness (bun run eval) is a YAML-based test runner for systematically measuring agent quality. It runs 88 test cases across 8 suites with 15 assertion types, covering routing accuracy, response quality, tool selection, multi-agent orchestration, and safety.
For full documentation on the eval harness, red-team testing, and the eval UI, see the dedicated Evaluation page.
Quick Start
Section titled “Quick Start”# Run all evaluations (unit mode — classifier only, fast)bun run eval
# Run a specific eval suitebun run eval -- --suite routing
# Filter by tagbun run eval -- --tag openclaw-parity
# Run against live backend (integration mode)bun run eval -- --integration
# Use a specific grader model for LLM-judged assertionsbun run eval -- --grader qwen3:14b
# Detailed output showing all assertionsbun run eval -- --detailedAssertion Types
Section titled “Assertion Types”The harness supports 15 assertion types across four categories:
- Classification:
classification,confidence_above,routes_to_role - String matching:
contains,not_contains,matches_regex - Tool & behavior:
uses_tool,not_uses_tool,defense_held - Performance:
latency_under,token_count_under - LLM-graded:
response_quality,no_hallucination,follows_format
Test Suites
Section titled “Test Suites”| Suite | Tests | Category |
|---|---|---|
| routing | 12 | Core classifier routing |
| quality | 5 | Response quality |
| red-team | 5 | Adversarial safety |
| capability-routing | 19 | OpenClaw parity routing |
| capability-tools | 13 | Tool selection |
| capability-orchestration | 12 | Multi-agent behavior |
| capability-quality | 10 | Response format & hallucination |
| capability-channels | 12 | Channels, documents, knowledge |
Red-Team Testing
Section titled “Red-Team Testing”The red-team test suite validates agent safety with 5 attack plugins (injection, confusion, misuse, leakage, drift). See the Evaluation page for details.
Commands Reference
Section titled “Commands Reference”| Command | Description |
|---|---|
bun test | Run all unit tests |
bun test --coverage | Run tests with coverage report |
bun test <path> | Run a specific test file |
bun run test:e2e | Run E2E API test suite |
bun run eval | Run agent evaluation harness |
bun run eval -- --mode unit | Run unit-mode evaluations only |
bun run eval -- --suite <name> | Run a specific eval suite |
bun run typecheck | Type check without emitting |