Rylvo, Inc.

Test Suite

Continuous confidence
for every bot.

Q: What test categories does Rylvo support?

Rylvo includes 10 built-in test categories: Happy Path, Edge Case, Guardrail Test, Escalation Test, Tone Check, Policy Compliance, Hallucination Probe, PII Check, Multi-Turn, and Custom.

Q: Can I run tests in real time?

Yes. The Test Suite Workbench streams every user/assistant turn in real time using live chat simulation. You watch the bot respond to each test case exactly as a real user would experience it.

Q: How are test results scored?

Each test run produces a verdict (Pass, Fail, Partial, Error, or Skipped) and a numerical score. Results are tracked over time so you can spot regressions and measure improvement.

Design test cases in plain English. Run them live against your real bot. Get pass/fail verdicts, track regressions, and debug every failure with a full trace — all from one workbench.

Try Test Suite See Agent Evolution

Ship with proof

Every prompt change, guardrail tweak, or KB update gets validated against your full test suite before it reaches users.

Catch regressions early

A passing test that suddenly fails is your earliest warning. Fix it before users ever see the broken behavior.

Debug with context

Every failed test links to its full trace. See stage-by-stage decisions, not just an error message.

How it works

Design. Categorize. Execute. Analyze. Debug.

Five stages run in sequence. Every test moves from idea to insight — with a full trace ready when something goes wrong.

The testing pipeline

live · real-time

01

Design

Write test cases in plain English

02

Categorize

Organize by intent, not by guesswork

03

Execute

Run live against your real bot

04

Analyze

Scores, trends, and regressions

05

Debug

One click to the full trace

01

Design

Write test cases in plain English

02

Categorize

Organize by intent, not by guesswork

03

Execute

Run live against your real bot

04

Analyze

Scores, trends, and regressions

05

Debug

One click to the full trace

01 · stage

Design

Write test cases in plain English

Create test cases using natural language. Define user messages, expected outputs, must-contain phrases, guardrail expectations, and more — no code required. Support single-message or full multi-turn conversation flows.

How it works

Single-message or multi-turn conversation flows
Expected output, must-contain, and must-not-contain assertions
Priority levels: Low, Medium, High

02 · stage

Categorize

Organize by intent, not by guesswork

Every test case gets a category and priority. Filter by Happy Path, Edge Case, Guardrail, Hallucination Probe, and 6 more. Know exactly what you're testing and why.

How it works

10 built-in categories with color-coded badges
Custom tags for domain-specific groupings
Enable/disable tests without deleting them

03 · stage

Execute

Run live against your real bot

Hit Run and watch your bot respond to every test case in real time. The workbench streams each turn through the actual production pipeline — no mocks, no stubs.

How it works

Real-time streaming of every user/assistant turn
Abort any run mid-flight with one click
Runs through the same pipeline as live traffic

04 · stage

Analyze

Scores, trends, and regressions

Each run produces a pass rate, average score, and per-test verdict. Track trends over time, spot regressions after prompt changes, and measure the impact of every improvement.

How it works

Pass rate + average score per run
Full chronological run history
Category-level breakdowns for targeted fixes

05 · stage

Debug

One click to the full trace

Every failed test links straight to its trace. See stage detection, action selection, verification results, and guardrail decisions — end to end, replayable.

How it works

Deep-link from any result to its full trace
Stage-by-stage replay of the decision pipeline
Comment and annotate failures for your team

Test category wheel

10 categories

Test Core

Happy Path

Standard successful interaction flows. Verify your bot handles the most common scenarios flawlessly.

Edge Case

Unusual or boundary-condition inputs. Catch the weird questions, typos, and unexpected phrasings.

Guardrail Test

Verify guardrails trigger correctly. Ensure blocks, warnings, and escalations fire at the right moments.

Escalation Test

Verify escalation conditions work. Confirm your bot knows when to hand off to a human operator.

Tone Check

Validate response tone matches brand. Keep your bot friendly, professional, empathetic — on brand.

Policy Compliance

Ensure policy adherence. Test that your bot respects refund policies, privacy rules, and legal constraints.

Hallucination Probe

Test if your bot makes up facts. Catch confabulated product specs, fake policies, and invented answers.

PII Check

Ensure PII is not leaked. Verify your bot never echoes back emails, phone numbers, or sensitive data.

Multi-Turn

Multi-message conversation flows. Test context retention across 3, 5, 10 turns and beyond.

Custom

User-defined test categories. Build your own taxonomy for domain-specific validation scenarios.

Verdict scoring

Five outcomes, one clear signal.

Pass

Met all assertions

Fail

Missed at least one assertion

Partial

Some assertions passed

Error

Runtime or network failure

Skipped

Test disabled or filtered out

Execution

Run live. No mocks. No stubs.

Every test case hits your real bot through the production pipeline. You see every turn as it happens.

Live execution stream

Every turn visible as it happens.

I want a refund for order #4821

I can help with that. Let me pull up order #4821...

Guardrail: refund_policy check → PASS

Verification: order_exists → PASS

Your refund of $129.00 has been approved.

run progress5/5 turns

Answers

What teams usually ask

Do I need to write code to create tests?

No. Test cases are written in plain English. You describe the user message, what the bot should say, what it must include, and what it must avoid. The workbench handles the rest.

Can I test multi-turn conversations?

Yes. The Multi-Turn category is built for this, and every test case supports a full message thread — not just a single prompt and response.

How does a test link to a trace?

Every test run generates a trace ID. Clicking Open Trace from any result jumps straight to the full decision pipeline — stage detection, action selection, verification, and guardrails — all replayable.

What happens when I change a prompt?

Re-run your test suite after any prompt, guardrail, or KB change. The pass rate comparison shows exactly what improved and what regressed.

Start testing in minutes

Build confidence with every release.

Create your first test case in under a minute. No setup, no scripting — just describe what you want to test and hit Run.

Start for free See pricing

Continuous confidence for every bot.

Design. Categorize. Execute. Analyze. Debug.

Design

Categorize

Execute

Analyze

Debug

Run live. No mocks. No stubs.

What teams usually ask

Build confidence with every release.

Continuous confidence
for every bot.