Skip to content

Testing

Octipus has two test layers: unit tests for individual modules and end-to-end (E2E) tests for the full API.

Unit tests use Bun’s built-in test runner.

Terminal window
bun test

Unit test files are co-located with their source files using the .test.ts suffix:

src/
├── utils/
│ ├── crypto.ts
│ └── crypto.test.ts
├── security/
│ ├── permissions.ts
│ └── permissions.test.ts
└── ...

The E2E test suite runs 112 tests against the running backend, covering all major API endpoints.

The backend server must be running before executing E2E tests:

Terminal window
# Start the backend first
bun run dev
# In another terminal, run E2E tests
bun run test:e2e

The E2E suite is located in scripts/e2e/ and organized into test modules:

scripts/e2e/
├── runner.ts # TestRunner, assert helpers
├── client.ts # APIClient wrapper
├── fixtures.ts # Shared test state
├── index.ts # Test orchestrator
└── tests/ # 22 test modules
├── health.ts
├── auth.ts
├── models.ts
├── vault.ts
├── agents.ts
├── sessions.ts
├── hooks.ts
├── chat.ts
├── documents.ts
├── browser-ext.ts
├── messaging.ts
├── knowledge.ts
├── channels.ts
├── experts.ts
├── recurring-tasks.ts
├── skills.ts
└── ...
ModuleWhat It Tests
healthDatabase, Redis connectivity, health probes
authRegistration, login, session management
modelsModel CRUD, health, CLI status, usage tracking
vaultCredential storage, update, rotation, deletion
skillsSkill CRUD, system skill listing
skill executionMCP bridge, tool execution via API
agentsSpawn, events, routing, stop, status
sessionsCRUD, messages, pagination
hooksHook creation, enable/disable, event types
chatMessage sending, session continuity
documentsDocument upload, listing, filtering, detail retrieval
browser-extBrowser extension v2 tool registration (24 tools)
messagingCross-channel messaging tool, list channels
knowledgeHybrid search modes (hybrid, fts, vector), tool registration
channelsWhatsApp webhook verification, Teams webhook handling
expertsExpert CRUD, system expert listing, expert-routed chat
recurring-tasksRecurring task CRUD, scheduling

The E2E suite uses a custom TestRunner with assert helpers:

// Example test using the TestRunner
test('GET /health returns ok', async () => {
const res = await client.get('/health');
assert.equal(res.status, 200);
assert.equal(res.data.status, 'ok');
});

The evaluation harness (bun run eval) is a YAML-based test runner for systematically measuring agent quality. It runs 88 test cases across 8 suites with 15 assertion types, covering routing accuracy, response quality, tool selection, multi-agent orchestration, and safety.

For full documentation on the eval harness, red-team testing, and the eval UI, see the dedicated Evaluation page.

Terminal window
# Run all evaluations (unit mode — classifier only, fast)
bun run eval
# Run a specific eval suite
bun run eval -- --suite routing
# Filter by tag
bun run eval -- --tag openclaw-parity
# Run against live backend (integration mode)
bun run eval -- --integration
# Use a specific grader model for LLM-judged assertions
bun run eval -- --grader qwen3:14b
# Detailed output showing all assertions
bun run eval -- --detailed

The harness supports 15 assertion types across four categories:

  • Classification: classification, confidence_above, routes_to_role
  • String matching: contains, not_contains, matches_regex
  • Tool & behavior: uses_tool, not_uses_tool, defense_held
  • Performance: latency_under, token_count_under
  • LLM-graded: response_quality, no_hallucination, follows_format
SuiteTestsCategory
routing12Core classifier routing
quality5Response quality
red-team5Adversarial safety
capability-routing19OpenClaw parity routing
capability-tools13Tool selection
capability-orchestration12Multi-agent behavior
capability-quality10Response format & hallucination
capability-channels12Channels, documents, knowledge

The red-team test suite validates agent safety with 5 attack plugins (injection, confusion, misuse, leakage, drift). See the Evaluation page for details.

CommandDescription
bun testRun all unit tests
bun test --coverageRun tests with coverage report
bun test <path>Run a specific test file
bun run test:e2eRun E2E API test suite
bun run evalRun agent evaluation harness
bun run eval -- --mode unitRun unit-mode evaluations only
bun run eval -- --suite <name>Run a specific eval suite
bun run typecheckType check without emitting