Test Suite

Continuous confidence for every bot.

Design test cases in plain English. Run them live against your real bot. Get pass/fail verdicts, track regressions, and debug every failure with a full trace — all from one workbench.

Ship with proof

Every prompt change, guardrail tweak, or KB update gets validated against your full test suite before it reaches users.

Catch regressions early

A passing test that suddenly fails is your earliest warning. Fix it before users ever see the broken behavior.

Debug with context

Every failed test links to its full trace. See stage-by-stage decisions, not just an error message.

How it works

Design. Categorize. Execute. Analyze. Debug.

Five stages run in sequence. Every test moves from idea to insight — with a full trace ready when something goes wrong.

The testing pipeline

live · real-time
01

Design

Write test cases in plain English

02

Categorize

Organize by intent, not by guesswork

03

Execute

Run live against your real bot

04

Analyze

Scores, trends, and regressions

05

Debug

One click to the full trace

01 · stage

Design

Write test cases in plain English

Create test cases using natural language. Define user messages, expected outputs, must-contain phrases, guardrail expectations, and more — no code required. Support single-message or full multi-turn conversation flows.

How it works

  • Single-message or multi-turn conversation flows
  • Expected output, must-contain, and must-not-contain assertions
  • Priority levels: Low, Medium, High
02 · stage

Categorize

Organize by intent, not by guesswork

Every test case gets a category and priority. Filter by Happy Path, Edge Case, Guardrail, Hallucination Probe, and 6 more. Know exactly what you're testing and why.

How it works

  • 10 built-in categories with color-coded badges
  • Custom tags for domain-specific groupings
  • Enable/disable tests without deleting them
03 · stage

Execute

Run live against your real bot

Hit Run and watch your bot respond to every test case in real time. The workbench streams each turn through the actual production pipeline — no mocks, no stubs.

How it works

  • Real-time streaming of every user/assistant turn
  • Abort any run mid-flight with one click
  • Runs through the same pipeline as live traffic
04 · stage

Analyze

Scores, trends, and regressions

Each run produces a pass rate, average score, and per-test verdict. Track trends over time, spot regressions after prompt changes, and measure the impact of every improvement.

How it works

  • Pass rate + average score per run
  • Full chronological run history
  • Category-level breakdowns for targeted fixes
05 · stage

Debug

One click to the full trace

Every failed test links straight to its trace. See stage detection, action selection, verification results, and guardrail decisions — end to end, replayable.

How it works

  • Deep-link from any result to its full trace
  • Stage-by-stage replay of the decision pipeline
  • Comment and annotate failures for your team

Test category wheel

10 categories

Happy Path

Standard successful interaction flows. Verify your bot handles the most common scenarios flawlessly.

Edge Case

Unusual or boundary-condition inputs. Catch the weird questions, typos, and unexpected phrasings.

Guardrail Test

Verify guardrails trigger correctly. Ensure blocks, warnings, and escalations fire at the right moments.

Escalation Test

Verify escalation conditions work. Confirm your bot knows when to hand off to a human operator.

Tone Check

Validate response tone matches brand. Keep your bot friendly, professional, empathetic — on brand.

Policy Compliance

Ensure policy adherence. Test that your bot respects refund policies, privacy rules, and legal constraints.

Hallucination Probe

Test if your bot makes up facts. Catch confabulated product specs, fake policies, and invented answers.

PII Check

Ensure PII is not leaked. Verify your bot never echoes back emails, phone numbers, or sensitive data.

Multi-Turn

Multi-message conversation flows. Test context retention across 3, 5, 10 turns and beyond.

Custom

User-defined test categories. Build your own taxonomy for domain-specific validation scenarios.

Verdict scoring

Five outcomes, one clear signal.

Pass

Met all assertions

Fail

Missed at least one assertion

Partial

Some assertions passed

Error

Runtime or network failure

Skipped

Test disabled or filtered out

Execution

Run live. No mocks. No stubs.

Every test case hits your real bot through the production pipeline. You see every turn as it happens.

Live execution stream

Every turn visible as it happens.

I want a refund for order #4821
I can help with that. Let me pull up order #4821...
Guardrail: refund_policy check → PASS
Verification: order_exists → PASS
Your refund of $129.00 has been approved.
run progress5/5 turns

Answers

What teams usually ask

Do I need to write code to create tests?

No. Test cases are written in plain English. You describe the user message, what the bot should say, what it must include, and what it must avoid. The workbench handles the rest.

Can I test multi-turn conversations?

Yes. The Multi-Turn category is built for this, and every test case supports a full message thread — not just a single prompt and response.

How does a test link to a trace?

Every test run generates a trace ID. Clicking Open Trace from any result jumps straight to the full decision pipeline — stage detection, action selection, verification, and guardrails — all replayable.

What happens when I change a prompt?

Re-run your test suite after any prompt, guardrail, or KB change. The pass rate comparison shows exactly what improved and what regressed.

Start testing in minutes

Build confidence with every release.

Create your first test case in under a minute. No setup, no scripting — just describe what you want to test and hit Run.