Test Suite
Test every bot. Everytime.
Create your first test case in under a minute. No setup, no scripting — just describe what should happen and let the test runner validate it end-to-end.
Ship with proof
Every prompt change, guardrail tweak, or KB update gets validated against your full test suite before it reaches users.
Catch regressions early
A passing test that suddenly fails is your earliest warning. Fix it before users ever see the broken behavior.
Debug with context
Every failed test links to its full trace. See stage-by-stage decisions, not just an error message.
How it works
Design. Categorize. Execute. Analyze. Debug.
Five stages run in sequence. Every test moves from idea to insight — with a full trace ready when something goes wrong.
Live regression matrix
scenarios × runs · pass / fail / driftPass
deterministic + LLM judge
Fail
with full transcript
Drift
regression vs last green
p95
latency budget tracked
Design
Write test cases in plain English
Create test cases using natural language. Define user messages, expected outputs, must-contain phrases, guardrail expectations, and more — no code required. Support single-message or full multi-turn conversation flows.
How it works
- Single-message or multi-turn conversation flows
- Expected output, must-contain, and must-not-contain assertions
- Priority levels: Low, Medium, High
Categorize
Organize by intent, not by guesswork
Every test case gets a category and priority. Filter by Happy Path, Edge Case, Guardrail, Hallucination Probe, and 6 more. Know exactly what you're testing and why.
How it works
- 10 built-in categories with color-coded badges
- Custom tags for domain-specific groupings
- Enable/disable tests without deleting them
Execute
Run live against your real bot
Hit Run and watch your bot respond to every test case in real time. The workbench streams each turn through the actual production pipeline — no mocks, no stubs.
How it works
- Real-time streaming of every user/assistant turn
- Abort any run mid-flight with one click
- Runs through the same pipeline as live traffic
Analyze
Scores, trends, and regressions
Each run produces a pass rate, average score, and per-test verdict. Track trends over time, spot regressions after prompt changes, and measure the impact of every improvement.
How it works
- Pass rate + average score per run
- Full chronological run history
- Category-level breakdowns for targeted fixes
Debug
One click to the full trace
Every failed test links straight to its trace. See stage detection, action selection, verification results, and guardrail decisions — end to end, replayable.
How it works
- Deep-link from any result to its full trace
- Stage-by-stage replay of the decision pipeline
- Comment and annotate failures for your team
Test category wheel
10 categoriesTest Core
Happy Path
Standard successful interaction flows. Verify your bot handles the most common scenarios flawlessly.
Edge Case
Unusual or boundary-condition inputs. Catch the weird questions, typos, and unexpected phrasings.
Guardrail Test
Verify guardrails trigger correctly. Ensure blocks, warnings, and escalations fire at the right moments.
Escalation Test
Verify escalation conditions work. Confirm your bot knows when to hand off to a human operator.
Tone Check
Validate response tone matches brand. Keep your bot friendly, professional, empathetic — on brand.
Policy Compliance
Ensure policy adherence. Test that your bot respects refund policies, privacy rules, and legal constraints.
Hallucination Probe
Test if your bot makes up facts. Catch confabulated product specs, fake policies, and invented answers.
PII Check
Ensure PII is not leaked. Verify your bot never echoes back emails, phone numbers, or sensitive data.
Multi-Turn
Multi-message conversation flows. Test context retention across 3, 5, 10 turns and beyond.
Custom
User-defined test categories. Build your own taxonomy for domain-specific validation scenarios.
Verdict scoring
Five outcomes, one clear signal.
Pass
Met all assertions
Fail
Missed at least one assertion
Partial
Some assertions passed
Error
Runtime or network failure
Skipped
Test disabled or filtered out
Execution
Run live. No mocks. No stubs.
Every test case hits your real bot through the production pipeline. You see every turn as it happens.
Live execution stream
Every turn visible as it happens.
Answers
What teams usually ask
Do I need to write code to create tests?
No. Test cases are written in plain English. You describe the user message, what the bot should say, what it must include, and what it must avoid. The workbench handles the rest.
Can I test multi-turn conversations?
Yes. The Multi-Turn category is built for this, and every test case supports a full message thread — not just a single prompt and response.
How does a test link to a trace?
Every test run generates a trace ID. Clicking Open Trace from any result jumps straight to the full decision pipeline — stage detection, action selection, verification, and guardrails — all replayable.
What happens when I change a prompt?
Re-run your test suite after any prompt, guardrail, or KB change. The pass rate comparison shows exactly what improved and what regressed.
Deep dive
What evaluation should look like for AI agents
Scenarios are tests, not vibes
A scenario in Rylvo is a structured test: starting message, expected behaviors, allowed tools, forbidden actions, and pass criteria. Verdicts come from a mix of deterministic checks and an LLM judge — not from a human eyeballing a transcript every release. That's how regressions get caught at PR time, not in production.
Every failure is reproducible
When a test fails, you don't get a summary — you get the full transcript, the prompt that produced it, the tools called, the KB chunks retrieved, and the judge's reasoning. One click promotes a real production incident into a permanent regression test so the same bug never ships twice.
Test on every change, not on demand
Prompt edits, KB updates, model upgrades, connector changes — they all trigger the suite automatically. The diff view shows exactly what improved, what regressed, and which scenarios sit on the edge so you can merge or roll back with real evidence instead of optimism.
Adversarial coverage, not just happy paths
The mutator generates jailbreak variants, paraphrases, multilingual versions, and prompt-injection probes for every scenario you write — so a small handful of human-authored tests becomes a thick safety net of hundreds of adversarial runs that catch real-world regressions.
Start testing in minutes
Build confidence with every release.
Create your first test case in under a minute. No setup, no scripting — just describe what you want to test and hit Run.
