Testing

Octipus has two test layers: unit tests for individual modules and end-to-end (E2E) tests for the full API.

Unit Tests

Unit tests use Bun’s built-in test runner.

Running Tests

bun test

bun test src/utils/crypto.test.ts

bun test --coverage

Test Location

Unit test files are co-located with their source files using the .test.ts suffix:

src/
├── utils/
│   ├── crypto.ts
│   └── crypto.test.ts
├── security/
│   ├── permissions.ts
│   └── permissions.test.ts
└── ...

End-to-End Tests

The E2E test suite runs 112 tests against the running backend, covering all major API endpoints.

Prerequisites

The backend server must be running before executing E2E tests:

# Start the backend first
bun run dev

# In another terminal, run E2E tests
bun run test:e2e

Test Organization

The E2E suite is located in scripts/e2e/ and organized into test modules:

scripts/e2e/
├── runner.ts       # TestRunner, assert helpers
├── client.ts       # APIClient wrapper
├── fixtures.ts     # Shared test state
├── index.ts        # Test orchestrator
└── tests/          # 22 test modules
    ├── health.ts
    ├── auth.ts
    ├── models.ts
    ├── vault.ts
    ├── agents.ts
    ├── sessions.ts
    ├── hooks.ts
    ├── chat.ts
    ├── documents.ts
    ├── browser-ext.ts
    ├── messaging.ts
    ├── knowledge.ts
    ├── channels.ts
    ├── experts.ts
    ├── recurring-tasks.ts
    ├── skills.ts
    └── ...

Test Coverage Areas

Module	What It Tests
health	Database, Redis connectivity, health probes
auth	Registration, login, session management
models	Model CRUD, health, CLI status, usage tracking
vault	Credential storage, update, rotation, deletion
skills	Skill CRUD, system skill listing
skill execution	MCP bridge, tool execution via API
agents	Spawn, events, routing, stop, status
sessions	CRUD, messages, pagination
hooks	Hook creation, enable/disable, event types
chat	Message sending, session continuity
documents	Document upload, listing, filtering, detail retrieval
browser-ext	Browser extension v2 tool registration (24 tools)
messaging	Cross-channel messaging tool, list channels
knowledge	Hybrid search modes (hybrid, fts, vector), tool registration
channels	WhatsApp webhook verification, Teams webhook handling
experts	Expert CRUD, system expert listing, expert-routed chat
recurring-tasks	Recurring task CRUD, scheduling

Custom Test Runner

The E2E suite uses a custom TestRunner with assert helpers:

// Example test using the TestRunner
test('GET /health returns ok', async () => {
  const res = await client.get('/health');
  assert.equal(res.status, 200);
  assert.equal(res.data.status, 'ok');
});

Agent Evaluation Harness

The evaluation harness (bun run eval) is a YAML-based test runner for systematically measuring agent quality. It runs 88 test cases across 8 suites with 15 assertion types, covering routing accuracy, response quality, tool selection, multi-agent orchestration, and safety.

For full documentation on the eval harness, red-team testing, and the eval UI, see the dedicated Evaluation page.

Quick Start

# Run all evaluations (unit mode — classifier only, fast)
bun run eval

# Run a specific eval suite
bun run eval -- --suite routing

# Filter by tag
bun run eval -- --tag openclaw-parity

# Run against live backend (integration mode)
bun run eval -- --integration

# Use a specific grader model for LLM-judged assertions
bun run eval -- --grader qwen3:14b

# Detailed output showing all assertions
bun run eval -- --detailed

Assertion Types

The harness supports 15 assertion types across four categories:

Classification: classification, confidence_above, routes_to_role
String matching: contains, not_contains, matches_regex
Tool & behavior: uses_tool, not_uses_tool, defense_held
Performance: latency_under, token_count_under
LLM-graded: response_quality, no_hallucination, follows_format

Test Suites

Suite	Tests	Category
routing	12	Core classifier routing
quality	5	Response quality
red-team	5	Adversarial safety
capability-routing	19	OpenClaw parity routing
capability-tools	13	Tool selection
capability-orchestration	12	Multi-agent behavior
capability-quality	10	Response format & hallucination
capability-channels	12	Channels, documents, knowledge

Red-Team Testing

The red-team test suite validates agent safety with 5 attack plugins (injection, confusion, misuse, leakage, drift). See the Evaluation page for details.

Commands Reference

Command	Description
`bun test`	Run all unit tests
`bun test --coverage`	Run tests with coverage report
`bun test <path>`	Run a specific test file
`bun run test:e2e`	Run E2E API test suite
`bun run eval`	Run agent evaluation harness
`bun run eval -- --mode unit`	Run unit-mode evaluations only
`bun run eval -- --suite <name>`	Run a specific eval suite
`bun run typecheck`	Type check without emitting