Continuous confidence
for every bot.
Design test cases in plain English. Run them live against your real bot. Get pass/fail verdicts, track regressions, and debug every failure with a full trace — all from one workbench.
Ship with proof
Every prompt change, guardrail tweak, or KB update gets validated against your full test suite before it reaches users.
Catch regressions early
A passing test that suddenly fails is your earliest warning. Fix it before users ever see the broken behavior.
Debug with context
Every failed test links to its full trace. See stage-by-stage decisions, not just an error message.
How it works
Design. Categorize. Execute. Analyze. Debug.
Five stages run in sequence. Every test moves from idea to insight — with a full trace ready when something goes wrong.
The testing pipeline
live · real-timeDesign
Write test cases in plain English
Categorize
Organize by intent, not by guesswork
Execute
Run live against your real bot
Analyze
Scores, trends, and regressions
Debug
One click to the full trace
Design
Write test cases in plain English
Categorize
Organize by intent, not by guesswork
Execute
Run live against your real bot
Analyze
Scores, trends, and regressions
Debug
One click to the full trace
Design
Write test cases in plain English
Create test cases using natural language. Define user messages, expected outputs, must-contain phrases, guardrail expectations, and more — no code required. Support single-message or full multi-turn conversation flows.
How it works
- Single-message or multi-turn conversation flows
- Expected output, must-contain, and must-not-contain assertions
- Priority levels: Low, Medium, High
Categorize
Organize by intent, not by guesswork
Every test case gets a category and priority. Filter by Happy Path, Edge Case, Guardrail, Hallucination Probe, and 6 more. Know exactly what you're testing and why.
How it works
- 10 built-in categories with color-coded badges
- Custom tags for domain-specific groupings
- Enable/disable tests without deleting them
Execute
Run live against your real bot
Hit Run and watch your bot respond to every test case in real time. The workbench streams each turn through the actual production pipeline — no mocks, no stubs.
How it works
- Real-time streaming of every user/assistant turn
- Abort any run mid-flight with one click
- Runs through the same pipeline as live traffic
Analyze
Scores, trends, and regressions
Each run produces a pass rate, average score, and per-test verdict. Track trends over time, spot regressions after prompt changes, and measure the impact of every improvement.
How it works
- Pass rate + average score per run
- Full chronological run history
- Category-level breakdowns for targeted fixes
Debug
One click to the full trace
Every failed test links straight to its trace. See stage detection, action selection, verification results, and guardrail decisions — end to end, replayable.
How it works
- Deep-link from any result to its full trace
- Stage-by-stage replay of the decision pipeline
- Comment and annotate failures for your team
Test category wheel
10 categoriesTest Core
Happy Path
Standard successful interaction flows. Verify your bot handles the most common scenarios flawlessly.
Edge Case
Unusual or boundary-condition inputs. Catch the weird questions, typos, and unexpected phrasings.
Guardrail Test
Verify guardrails trigger correctly. Ensure blocks, warnings, and escalations fire at the right moments.
Escalation Test
Verify escalation conditions work. Confirm your bot knows when to hand off to a human operator.
Tone Check
Validate response tone matches brand. Keep your bot friendly, professional, empathetic — on brand.
Policy Compliance
Ensure policy adherence. Test that your bot respects refund policies, privacy rules, and legal constraints.
Hallucination Probe
Test if your bot makes up facts. Catch confabulated product specs, fake policies, and invented answers.
PII Check
Ensure PII is not leaked. Verify your bot never echoes back emails, phone numbers, or sensitive data.
Multi-Turn
Multi-message conversation flows. Test context retention across 3, 5, 10 turns and beyond.
Custom
User-defined test categories. Build your own taxonomy for domain-specific validation scenarios.
Verdict scoring
Five outcomes, one clear signal.
Pass
Met all assertions
Fail
Missed at least one assertion
Partial
Some assertions passed
Error
Runtime or network failure
Skipped
Test disabled or filtered out
Execution
Run live. No mocks. No stubs.
Every test case hits your real bot through the production pipeline. You see every turn as it happens.
Live execution stream
Every turn visible as it happens.
Answers
What teams usually ask
Do I need to write code to create tests?
No. Test cases are written in plain English. You describe the user message, what the bot should say, what it must include, and what it must avoid. The workbench handles the rest.
Can I test multi-turn conversations?
Yes. The Multi-Turn category is built for this, and every test case supports a full message thread — not just a single prompt and response.
How does a test link to a trace?
Every test run generates a trace ID. Clicking Open Trace from any result jumps straight to the full decision pipeline — stage detection, action selection, verification, and guardrails — all replayable.
What happens when I change a prompt?
Re-run your test suite after any prompt, guardrail, or KB change. The pass rate comparison shows exactly what improved and what regressed.
Start testing in minutes
Build confidence with every release.
Create your first test case in under a minute. No setup, no scripting — just describe what you want to test and hit Run.
