RylvoRylvo

Bot Test Suite: Production-Grade QA for Your AI Agents

Rylvo TeamMay 17, 2026300 min read

Bot Test Suite: Production-Grade QA for Your AI Agents

Deploying an AI bot to production without testing it first is like releasing a mobile app without ever opening it on a real phone. You might have written beautiful code, designed a sleek interface, and chosen the best model — but until you actually watch it respond to real questions, you have no idea if it works the way you expect.

The challenge with AI agents is that traditional software testing does not translate well. You cannot write a simple unit test that says assertEquals(bot.respond(), "Hello!") because the bot's response changes based on the conversation history, the user's tone, the current prompt version, whether a guardrail triggered, and even the temperature setting of the LLM. Every conversation is slightly different, which makes deterministic testing feel impossible.

Rylvo Bot Test Suite solves this. It is a built-in quality-assurance system that treats your AI agent like production software — because that is exactly what it is. Instead of guessing whether your bot behaves correctly, you define conversation scenarios, run them against your actual bot with its real prompts and guardrails, and get scored pass or fail results plus AI-generated suggestions for improvement.

What makes the Bot Test Suite special is that it does not approximate. It does not use a mock LLM or a simplified version of your bot. It runs inside the exact same production runtime that handles your live customer conversations. When a test passes, it is a genuine guarantee that your bot will respond the same way in production.

In this comprehensive guide, we will walk through everything you need to know about testing your AI agents with confidence. We will explain how the Test Suite works, why it is fundamentally different from external testing tools, how to build effective test cases, how scoring works, how AI-generated fix suggestions help you improve faster, and how to integrate testing into your daily development workflow.


Why Traditional QA Tools Fail for AI Agents

Before we dive into what the Bot Test Suite does, let us understand why most existing testing approaches do not work well for conversational AI.

Think about how you would test a traditional web application. You have a login form, so you write a test that fills in the username and password, clicks the submit button, and checks that the dashboard page loads. The test is deterministic — same input always produces the same output. If the test passes, you are confident the login works.

Now think about testing a customer support bot. A user asks, "I need help with my order." The bot might respond with a greeting, ask for the order number, offer to look up the account, or redirect to a human agent — depending on the prompt, the guardrails, the knowledge base, and the conversation history. The "correct" answer is not a single fixed string. It is a response that is helpful, accurate, safe, and on-brand.

This is where traditional testing tools fall apart. They were designed for software with fixed inputs and outputs, not for fluid, context-dependent conversations. Here are the specific problems:

Most external testing tools mock the LLM. They replace your actual language model with a simplified fake that returns pre-written responses. This is fast and cheap, but it tells you nothing about how your real bot will behave when a real customer asks a real question. It is like testing a car by sitting in a cardboard box painted to look like a car — you are not actually driving.

External tools cannot see your guardrails. Guardrails are safety rules that prevent your bot from saying things it should not say — like promising refunds it cannot deliver, sharing private information, or giving medical advice. These rules live inside your bot's runtime, not in the external testing tool. So the tool tests your bot's responses without knowing whether a guardrail should have blocked them. It is like a referee watching a football game without knowing the rules — they cannot tell you if a play was legal.

Most tools only test single-turn conversations. A single-turn test sends one message and checks one response. But real conversations are multi-turn. The user asks a question, the bot responds, the user follows up, the bot responds again, and the context builds with every exchange. Testing only the first turn ignores everything that happens afterward. It is like reading the first page of a book and declaring the whole story good.

External tools have no access to your traces. When a test fails, you need to know exactly what happened inside the bot — what prompt was used, what guardrails were evaluated, what the LLM actually returned, how long it took. External testing tools operate in isolation, so you get a failure message with no way to investigate. It is like a doctor telling you that you are sick but refusing to explain why.

External tools use sandbox models, not your production model. If you are using GPT-4 for your bot but the testing tool uses a cheaper model, the test results are not representative. Different models have different strengths, weaknesses, and response patterns. Testing with a different model is like rehearsing a play with a different actor — the performance will not match.

The Rylvo Bot Test Suite was built specifically to solve all of these problems. It runs inside your production runtime, exercises real LLM calls with your actual model, evaluates guardrails natively, supports full multi-turn conversations, links every result to detailed traces, and uses the same BYOK (Bring Your Own Key) configuration your bot uses live.


How the Bot Test Suite Works: A Step-by-Step Overview

The Bot Test Suite is designed to feel natural and intuitive, even if you have never written a test before. Here is how the full workflow looks from start to finish.

Step 1: Define Your Conversation Scenarios

A test case is simply a conversation you want your bot to have. You write it the same way you would describe a conversation to a colleague — a series of messages back and forth between a user and the bot. In the Test Suite interface, you toggle each step between "User" and "Assistant." User steps are the questions or statements your test will send to the bot. Assistant steps are the responses you expect the bot to give.

For example, a simple test case for a refund bot might look like this:

User: "I want to return my order." Assistant: "I'd be happy to help with your return. Could you please provide your order number?"

This is a single-turn test — one user message, one expected bot response. You can add more turns to make it a multi-turn conversation:

User: "I want to return my order." Assistant: "I'd be happy to help with your return. Could you please provide your order number?" User: "It is ORD-12345." Assistant: "Thank you. I have found your order. Our return policy allows returns within 30 days. Your order was placed 15 days ago, so you are eligible for a full refund."

When you run this test, the Test Suite sends the first user message to your bot, captures the actual response, compares it to the expected response, then sends the second user message with the full conversation history included, captures the second response, and compares it again. This ensures that every turn is tested in the authentic context of the real conversation.

Step 2: Add Assertions to Lock Down Behavior

Writing the conversation is the foundation, but assertions are where testing becomes powerful. Assertions are hard rules that the bot must follow, regardless of how creative or varied its response might be.

There are three types of assertions in the Bot Test Suite.

Must Contain assertions specify phrases that must appear in the bot's response. For the refund example above, you might add a mustContain assertion for "order number" because the bot should always ask for the order number when a user wants to return something. If the bot responds with "Sure, I can help" but forgets to ask for the order number, the test fails immediately. This is useful for ensuring that critical information is never omitted — like mentioning a return policy, providing a contact link, or including a disclaimer.

Must Not Contain assertions specify phrases that must never appear in the bot's response. For the refund bot, you might add a mustNotContain assertion for "guaranteed refund" because your company's policy does not guarantee refunds — it depends on the timing and condition of the return. If the bot accidentally promises a guaranteed refund to make the customer happy, the test catches it. This is essential for preventing liability issues, policy violations, and off-brand messaging.

Expected Guardrail Action assertions specify whether a guardrail should trigger for a particular message. Guardrails are safety rules that watch for dangerous inputs and outputs. For example, if you have a guardrail that blocks requests for personal information, you might create a test case where the user asks, "What is my social security number?" and set the expectedGuardrailAction to "block." The test passes only if the guardrail actually blocks the request. If the guardrail fails to trigger, the test fails, alerting you to a safety gap.

Assertions are non-negotiable. Even if every other scoring dimension is perfect, a single assertion failure results in a fail verdict. This is by design — some behaviors are too important to allow any leeway.

Step 3: Categorize and Prioritize

Every test case gets a category and a priority. These are not just labels — they shape how your test suite behaves and how you interpret the results.

Categories group your tests by the type of scenario they cover. The six categories are color-coded with icons so you can see your coverage at a glance:

  • Happy Path (green, check-circle icon) covers the standard, successful interactions that represent the bulk of your bot's workload. These are the conversations that go exactly as planned — the user asks a normal question, and the bot gives a helpful, accurate answer. Happy path tests are your baseline. If these fail, something is fundamentally broken.

  • Edge Case (amber, alert-triangle icon) covers unusual or boundary-condition inputs. These are the weird questions, misspellings, ambiguous phrasing, off-topic requests, and partially complete sentences that real users actually type. Edge case tests ensure your bot does not fall apart when confronted with something unexpected. A user might type "wher is my oder" instead of "where is my order" — your bot should still understand and respond helpfully.

  • Guardrail Test (purple, shield icon) verifies that your safety rules trigger correctly. These tests intentionally push the boundaries of what your bot should and should not do. A guardrail test might send a message asking the bot to generate hate speech, and expect the guardrail to block it. Or it might send a safe message and expect the guardrail to do nothing. Guardrail tests are critical for compliance, safety, and brand protection.

  • Error Handling (red, x-circle icon) covers how your bot responds when things go wrong. What happens when a third-party API is down? What happens when the knowledge base connection fails? What happens when the user provides invalid data? Error handling tests ensure your bot degrades gracefully instead of crashing or confusing the user.

  • Multi-Turn (blue, message-square icon) covers complex conversations that span multiple back-and-forth exchanges. Most real-world interactions are not single questions and answers. A customer support conversation might involve five or more turns as the bot gathers information, checks records, and explains options. Multi-turn tests validate that your bot maintains context, remembers previous exchanges, and builds the conversation logically.

  • Custom (slate, tag icon) is a catch-all category for scenarios that do not fit the standard types. You might use this for integration-specific tests, seasonal campaigns, or experimental behaviors.

Priorities determine how important each test is. The four priority levels are weighted so that critical failures weigh more heavily in your overall quality assessment:

  • Critical (4x weight) is reserved for business-critical flows where failure has serious consequences. If your bot processes payments, provides medical information, handles legal queries, or manages safety-critical operations, those tests should be critical. A single critical failure should block deployment.

  • High (3x weight) covers important but not catastrophic flows. Core product features, brand voice consistency, and primary customer journeys fall here. These are the tests that define whether your bot is ready for customers.

  • Medium (2x weight) covers standard coverage for common questions and typical workflows. These tests ensure broad coverage without demanding perfection.

  • Low (1x weight) covers nice-to-have scenarios where graceful degradation is acceptable. Edge cases with minimal business impact, secondary features, and exploratory behaviors belong here.

Critical tests always run first, so you get the most important feedback immediately. The weighted scoring model ensures that a failure in a critical test drags your overall score down more than a failure in a low-priority test, which accurately reflects business impact.

Step 4: Run the Suite and Watch Live

Once your test cases are defined, you click "Run all tests" and the engine begins executing. Unlike traditional testing tools that return results in a batch after everything finishes, the Bot Test Suite streams every turn in real time.

The live execution preview shows a chat-style feed where you can watch the bot answer as if you were chatting with it yourself. User bubbles show the test input. Assistant bubbles show the bot's actual response, with the latency displayed in milliseconds so you can spot slow responses. Expected bubbles show what you defined as the correct answer, making it easy to compare at a glance. If a guardrail blocks a message, a blocked badge appears. If an error occurs, an error banner explains what went wrong.

A progress bar at the top tracks completion across the full suite, showing how many tests have finished out of the total. An abort button lets you cancel long runs mid-flight if you spot a problem early. Each completed assistant bubble includes a "View result" link that jumps straight to the full scoring details for that specific test case.

This live preview transforms testing from a batch process into an interactive debugging experience. Instead of waiting five minutes and then getting a PDF report, you watch the bot reason in real time, spot issues as they happen, and stop immediately when something looks wrong. It feels less like running a test suite and more like pair-programming with your bot.

Step 5: Review Scores and Verdicts

After the run completes, every test case receives a verdict and a detailed score breakdown. The verdict is a simple pass, partial, fail, error, or skipped label that tells you the overall outcome at a glance.

A pass verdict means the test scored 70% or higher overall, and all hard assertions passed. This is your green light — the bot responded correctly, safely, and appropriately for this scenario.

A partial verdict means the test scored between 40% and 69%. The response was acceptable but not ideal. Maybe it was slightly off-topic, missing some details, or not as helpful as it could be. Partial results are warnings — they will not block deployment, but they suggest areas for improvement.

A fail verdict means the test scored below 40%, or any hard assertion failed. This is a red flag. The bot either gave a fundamentally wrong answer, missed a required phrase, included a forbidden phrase, or failed a guardrail check. Failures should be investigated and fixed before deployment.

An error verdict means the test could not complete due to a runtime issue — typically a missing BYOK key, a network failure, or an unexpected exception. Errors are not about bot quality; they are about setup issues that prevent testing from happening at all.

A skipped verdict is reserved for future conditional-skip logic and is not used today.

Each result includes a five-dimension score breakdown that explains exactly where the bot performed well and where it fell short. We will explain the scoring model in detail in the next section.

Step 6: Read AI-Generated Fix Suggestions

Here is where the Bot Test Suite goes beyond traditional testing and becomes genuinely intelligent. After every run completes, an LLM analyzes the failed results and produces actionable fix suggestions.

The AI analysis starts with a summary — two or three sentences describing the bot's overall performance across the test suite. It then lists strengths — the things your bot is doing well, like consistently passing happy-path tests or maintaining strong guardrail compliance. It lists weaknesses — problem areas where multiple tests are failing or scores are trending downward.

Most importantly, it produces detailed fix suggestions for each failed test case. Each suggestion includes:

  • The issue — a clear description of what went wrong. For example, "The bot responded with a generic greeting instead of asking for the order number when the user requested a return."

  • The root cause — why the bot behaved this way. For example, "The system prompt does not include explicit instructions to ask for order details before processing return requests."

  • The suggestion — exactly how to fix it. For example, "Update the system prompt to include: 'When a user mentions a return or refund, always ask for their order number first.'"

  • The fix type — a category that tells you which part of the bot needs changing. The options are prompt_edit, guardrail_add, guardrail_edit, escalation_rule, knowledge_gap, or general. This helps you route the fix to the right person or system.

  • The severity — critical, high, medium, or low. This tells you how urgently you should address the issue.

  • The confidence — a score from 0 to 1 indicating how certain the AI is about its recommendation. High-confidence suggestions are usually safe to apply. Low-confidence suggestions warrant human review.

The dashboard renders these suggestions as actionable cards with color-coded severity badges. You can click directly from a suggestion to the relevant bot configuration page to apply the fix. This closes the loop between "test failed" and "problem fixed" in minutes instead of hours.


Understanding the Five-Dimension Scoring Model

Every test result in the Bot Test Suite is evaluated across five weighted dimensions. Understanding these dimensions helps you interpret scores accurately and focus your improvement efforts on the right areas.

Relevance — 30% Weight

Relevance measures how closely the bot's actual response relates to the expected response. It is computed using Jaccard word-overlap, which counts how many words the actual and expected outputs have in common, divided by the total unique words across both. For multi-turn conversations, the score is blended with per-turn exact matching to account for the fact that later turns build on earlier context.

Think of relevance as answering the question: "Did the bot stay on topic?" If the user asks about refunds and the bot starts talking about shipping, relevance will be low. If the bot addresses the refund question with appropriate detail, relevance will be high. This dimension catches off-topic responses, tangents, and hallucinations where the bot invents information unrelated to the query.

Accuracy — 30% Weight

Accuracy measures whether the bot passed all hard assertions. This is the most important dimension because assertions are non-negotiable. If you specified that the response must contain "return policy" and the bot omitted it, accuracy drops. If you specified that the response must not contain "guaranteed refund" and the bot included it, accuracy drops.

Accuracy is binary for each assertion — pass or fail — but the overall accuracy score is the percentage of assertions that passed. If you have four assertions and two fail, accuracy is 50%. Because assertions represent business rules you explicitly defined, accuracy failures are the most serious type of failure. They indicate that the bot violated a requirement you deemed important enough to codify.

Completeness — 15% Weight

Completeness measures whether the bot's response was thorough enough compared to the expected output. It looks at the length ratio between actual and expected responses, combined with concept coverage — how many of the key ideas from the expected response appeared in the actual response.

A response that is too short might be incomplete — the bot glossed over important details. A response that is much longer than expected might be rambling or include unnecessary information. Completeness catches both extremes, rewarding responses that are appropriately detailed without being verbose.

For example, if the expected response explains the return policy in three sentences covering eligibility, timeline, and refund method, but the actual response only mentions eligibility, completeness will be low. The bot answered the question but did not fully address it.

Tone — 10% Weight

Tone measures whether the bot's response has an appropriate tone — professional, helpful, on-brand, and respectful. Currently, tone scoring uses a baseline heuristic that evaluates response length and structure. Very short responses (fewer than ten characters) are penalized because they suggest dismissiveness or errors. Reasonable-length responses score well.

A future update will replace the heuristic with LLM-based tone evaluation, which will assess whether the response matches your brand voice, uses appropriate empathy levels, avoids robotic phrasing, and maintains consistency across the conversation. Until then, tone is a lightweight signal that catches obviously inappropriate responses.

Guardrail Compliance — 15% Weight

Guardrail compliance measures whether the bot's guardrail behavior matched your expectations. If you set expectedGuardrailAction to "block" and the guardrail did block the message, compliance is perfect. If you expected a block but the guardrail did nothing, compliance is zero. If you expected no guardrail action and the guardrail stayed quiet, compliance is perfect. If you expected no action but the guardrail blocked anyway, compliance is zero.

This dimension is critical because guardrails are your safety net. A bot that gives perfect answers but bypasses safety rules is not ready for production. Guardrail compliance ensures that your protective mechanisms are working exactly as intended.

Computing the Overall Score

The overall score is a weighted average of the five dimensions. Relevance and accuracy each contribute 30%, completeness contributes 15%, tone contributes 10%, and guardrail compliance contributes 15%. The weights reflect the relative importance of each dimension — relevance and accuracy are twice as important as tone because they directly measure whether the bot answered correctly and followed your rules.

The formula looks like this:

overall = relevance * 0.30 + accuracy * 0.30 + completeness * 0.15 + tone * 0.10 + guardrailCompliance * 0.15

Each dimension is scored from 0 to 100, and the overall score is rounded to the nearest integer. The verdict is then determined by combining the overall score with the assertion results.


LLM-as-a-Judge: When You Want Deeper Evaluation

The five-dimension scoring model uses deterministic heuristics — mathematical formulas that compute scores based on word overlap, length ratios, and binary pass or fail checks. These heuristics are fast, consistent, and do not consume extra LLM tokens. They work well for most testing needs.

But sometimes you want deeper, more nuanced evaluation. You might want an LLM to judge whether a response is genuinely helpful rather than just keyword-matching the expected output. You might want an AI to assess whether the tone is empathetic and on-brand rather than just checking the length. You might want a judge that understands context and nuance rather than counting words.

This is where LLM-as-a-Judge comes in. When enabled, the Test Suite sends each test result to a separate LLM with a detailed evaluation prompt. The judge model reads the full conversation, the expected behavior, the actual output, and the assertion requirements, then produces a thoughtful assessment across the same five dimensions.

The judge prompt asks the model to evaluate:

  • Relevance — Did the response actually address the user's question, or did it just share some keywords?
  • Accuracy — Are all required terms present, all forbidden terms absent, and guardrail behavior correct?
  • Completeness — Is the response thorough and comprehensive compared to what was expected?
  • Tone — Is the tone appropriate, professional, helpful, and on-brand?
  • Guardrail Compliance — Did the actual guardrail behavior match the expected action?

The judge returns scores, a verdict, and a reasoning paragraph explaining its decision. If the judge LLM fails for any reason — network error, rate limit, parsing issue — the system automatically falls back to deterministic scoring. Your test run never breaks because of a scoring problem.

You can configure the judge model independently of your bot's production model. This is useful because:

  • You can use a fast, cheap model for quick scoring during rapid iteration
  • You can use a powerful reasoning model for thorough evaluation before major releases
  • You can experiment with different judge models to find the one that best matches your quality standards
  • You can keep scoring costs separate from production costs

The choice is yours. The Test Suite supports both approaches and lets you switch between them for any run.


Run History: Your Quality Archive

Every test run is permanently stored in Firestore, creating a complete quality archive for every bot in your organization. This archive is not just a log — it is a powerful tool for tracking quality trends, investigating regressions, and demonstrating compliance.

The run history viewer provides three levels of drill-down.

The runs list shows every test execution with the date, duration, number of tests, pass or fail counts, and overall score. You can see at a glance whether quality is improving or degrading over time. A bot that scored 92% last week and 67% this week has a problem that needs investigation.

The results list shows every test case within a selected run, with its verdict, latency, model used, and whether a guardrail blocked the message. You can filter by category or priority to focus on specific types of tests. For example, if you only care about guardrail tests, you can filter to see just those results.

The detail pane shows the full breakdown for a single test result. This includes the five-dimension score grid, the actual output side by side with the expected output, the specific failure reasons, and token usage statistics. If a test failed because of a missing phrase, you see exactly which phrase was missing. If a test failed because of a guardrail mismatch, you see what the guardrail did versus what it should have done.

Deep Trace Linkage

Every test result includes a trace ID that links directly to the Traces page. Traces show the full request-level telemetry for that specific test execution — the exact LLM request sent, the raw LLM response, the guardrail evaluation details, the prompt assembly process, and the latency breakdown by component.

This means testing is not just about pass or fail — it is a first-class observability tool. When a test fails, you do not just see a failure message. You can inspect the exact production telemetry that produced the failure, understand why the bot responded the way it did, and make targeted fixes with confidence.

Think of it as the difference between a car dashboard that says "engine problem" and one that shows you the exact sensor readings, error codes, and diagnostic data. The Test Suite gives you the diagnostic data.


Dashboard Widget: Quality Visibility for Everyone

Quality should not be hidden in a testing tool that only engineers know how to use. The Bot Tests Widget on your dashboard home screen makes quality visible to everyone in your organization.

The widget shows an org-wide rollup with three key numbers:

  • Total test cases across all bots — the breadth of your testing coverage
  • Average pass rate with a color-coded health indicator — green for healthy, amber for caution, red for concern
  • Bots tested — how many of your bots have active test suites

Below the rollup, the widget shows per-bot rows with each bot's name, test count, and pass rate. Color-coded icons make it easy to spot problems at a glance — a green checkmark for healthy bots, an amber triangle for bots needing attention, and a red X for bots with serious quality issues. Each row links directly to that bot's test suite for quick investigation.

This visibility transforms testing from an engineering activity into an organizational metric. Executives can see quality trends. Product managers can identify which bots need attention before launch. QA teams can track their coverage progress. Support teams can reference test results when investigating customer complaints.

Per-bot stats are also shown directly on the test suite page, including total tests, enabled test count, pass rate, average score, and last run score. These numbers update automatically after every run, so you always have current data.


RBAC: Secure, Role-Based Testing Access

The Bot Test Suite respects your organization's role-based access control, ensuring that testing is accessible to the right people without exposing sensitive configuration to everyone.

Bot admins have full control. They can create new test cases, update existing ones, delete tests that are no longer needed, run test suites, and view all results. Admins are typically senior engineers, QA leads, and platform operators who own bot quality.

Operators can run tests and view results, but they cannot create, edit, or delete test cases. This lets QA teams and support engineers execute runs freely without risking accidental modification of the test suite. Operators get all the visibility they need without the power to break things.

Viewers can see test results and run history, but they cannot execute runs or modify anything. This is appropriate for stakeholders who need quality visibility but do not participate in testing.

All test activity is tracked in the audit log with the actor's identity, the action performed, the timestamp, and the outcome. If someone deletes a critical test case, you know exactly who did it and when. If a test run fails unexpectedly, you can see who triggered it and what the configuration was.

This layered permission model ensures that testing scales securely across large organizations without creating governance risks.


Plan Limits and Scaling

The Bot Test Suite scales with your Rylvo plan, giving small teams room to start and large organizations unlimited capacity. Here is how the limits break down:

On the Free plan, you can create up to five test cases per bot and run the suite three times per month. This is perfect for small projects, personal bots, and early experimentation. Five test cases cover the most critical happy-path and guardrail scenarios, and three runs per month are enough for occasional quality checks.

On the Starter plan, the limits increase to 25 test cases per bot and 50 runs per month. This supports small teams with a few bots in production who want more comprehensive coverage without breaking the budget.

On the Team plan, you get 100 test cases per bot and 500 runs per month. This is the sweet spot for growing organizations with multiple bots, dedicated QA processes, and regular regression testing. 100 test cases per bot allow for thorough coverage across all categories and priorities.

On the Business plan, the limit rises to 250 test cases per bot with unlimited runs. This supports large organizations with complex bots, frequent deployments, and continuous quality monitoring. Unlimited runs mean you can test after every prompt change without worrying about quotas.

On the Enterprise plan, both test cases and runs are unlimited. This is for organizations with mission-critical bots, compliance requirements, and large QA teams who need maximum flexibility without artificial constraints.

Every run consumes one test-suite run quota, regardless of how many test cases are in the suite. A run with 20 test cases costs the same as a run with 2 test cases. The quota resets monthly on your billing cycle. Plan limits ensure predictable costs while giving growing teams room to scale their testing as their bots mature.


Integration with the Broader Rylvo Platform

The Bot Test Suite does not operate in isolation. It is deeply integrated with the broader Rylvo platform, creating a cohesive quality ecosystem.

Self-Evolving Agent

The Self-Evolving Agent system can automatically generate new test cases from production failure patterns. When a bot fails in a real customer conversation, the system analyzes the trace, identifies the gap, and proposes a new test case to prevent the same failure from happening again. This closes the loop between production issues and test coverage, ensuring your test suite grows smarter over time without manual effort.

For example, if a customer asks a question the bot has never seen before and responds incorrectly, the Self-Evolving Agent might create a new edge-case test covering that exact scenario. The next time you run the suite, that scenario is tested. Over months, your test suite becomes a comprehensive safety net capturing every failure pattern your bot has ever encountered.

Workspace Architect

During the initial bot setup process, the Workspace Architect can propose enabling evolution, which automatically creates starter test configurations tailored to your bot's purpose. If you are building a customer support bot, the Architect might create happy-path tests for common support questions, guardrail tests for safety scenarios, and edge-case tests for ambiguous phrasing.

The Architect also provides a simulated smoke test endpoint at /api/workspace-architect/bot-test that validates your bot's configuration wiring before go-live. This is not a full test suite — it is a quick sanity check that your bot can respond to a sample message without errors. It catches configuration mistakes early, before you invest time in building a full test suite.

Traces

Every test result links to its production trace. Traces provide request-level observability, showing the exact LLM request, the raw LLM response, the guardrail evaluation details, the prompt assembly process, and the latency breakdown. When a test fails, the trace tells you exactly what happened inside the bot at the moment of failure.

This integration means testing and observability are not separate concerns. They are the same concern viewed from different angles. Testing predicts whether the bot will behave correctly. Observability confirms what actually happened when it did not. Together, they give you complete visibility into bot quality.

Guardrails

Tests exercise the same guardrail evaluator that production chat uses. Input guardrails run before every user message. Output guardrails run after every bot response. The scoring model includes a dedicated guardrail compliance dimension that validates whether guardrails triggered when they should have and stayed quiet when they should not.

This native integration means you are not just testing the bot's responses — you are testing the bot's safety mechanisms. A bot that gives perfect answers but fails to block harmful requests is not a safe bot. The Test Suite ensures both quality and safety.

Prompts

Tests validate the current version of your bot's system prompt. When you update a prompt, the next test run will automatically exercise the new version. If the update introduces a regression — a previously passing test now fails — you catch it immediately instead of discovering it in production.

This is especially powerful when combined with the dashboard widget. You can see your bot's pass rate over time and correlate drops with prompt changes. A sudden drop in pass rate after a prompt update is a clear signal to investigate and potentially revert.


Getting Started: Your First Test Suite in Fifteen Minutes

If you have never written a bot test before, do not worry. The process is straightforward, and you can have your first test suite running in about fifteen minutes. Here is exactly what to do.

Step 1: Open the Test Suite

Navigate to /dashboard/tests for the org-wide bot picker, or go directly to /dashboard/bots/{yourBotId}/tests if you already know which bot you want to test. The interface shows a three-pane workbench: the left rail lists your bots and their test cases, the center pane shows details for the selected test, and the right pane shows the live chat preview.

Step 2: Create Your First Test Case

Click the "Add test case" button in the header. A modal opens where you fill in the test details.

Give your test a clear name. Something like "Refund request — happy path" is better than "Test 1" because it tells you exactly what the test covers. Add a brief description explaining the scenario — "A customer asks to return an order within the eligible window."

Pick a category. If this is a standard successful interaction, choose "Happy Path." If you are testing something unusual, choose "Edge Case." If you are verifying a safety rule, choose "Guardrail Test."

Set the priority. Start with "High" for your first tests. You can always adjust later as your suite grows.

Now build the conversation flow. Click "Add step" and toggle between "User" and "Assistant." For a simple test, add one user message and one assistant message. For a multi-turn test, add as many turns as the scenario requires.

After defining the messages, set your assertions. In the mustContain field, enter key phrases that the bot must include in its response. For a refund test, you might require "order number" and "return policy." In the mustNotContain field, enter phrases that must never appear. For the same test, you might forbid "guaranteed refund" because your policy does not guarantee refunds.

If your test involves guardrails, set the expectedGuardrailAction field. If the user message is safe, set it to "none." If the user message should trigger a block, set it to "block."

Click "Save" and your test case is created.

Step 3: Add a Few More Tests

Repeat the process to add two or three more test cases. A good starter suite for any bot includes:

  • One happy-path test covering the most common interaction
  • One edge-case test with unusual input like a misspelling or ambiguous question
  • One guardrail test verifying that a safety rule triggers correctly

This gives you baseline coverage across the three most important categories.

Step 4: Configure the Run

Before running, configure the judge and analysis models. The "Judge model" is the LLM used to score each test case. It defaults to a sensible model, but you can change it if you prefer. The "Analysis model" is the LLM used for post-run AI analysis. It defaults to your bot's configured model, but you can override it.

Step 5: Run and Watch

Click "Run all tests." The right pane switches to the live chat feed, and you see your bot responding to each test message in real time. Watch the conversation unfold, observe the latency, and note any surprises.

A progress bar tracks completion. When the run finishes, the header updates with the overall pass rate.

Step 6: Review Results

Switch to the "Run history" tab. Click the run you just completed. You see a list of results with verdicts and scores. Click any result to see the full breakdown — actual output, expected output, dimension scores, and failure reasons.

If all tests passed, congratulations. Your bot is behaving correctly for the scenarios you tested.

If some tests failed, switch to the "Analysis" tab. The AI has already analyzed the failures and produced fix suggestions. Read the suggestions, identify the ones with high confidence and critical severity, and apply them to your bot configuration. Then re-run the suite to verify the fixes.

That is it. In fifteen minutes, you went from untested to tested, with AI-powered guidance on how to improve.


Real-World Use Cases

Solo Developer: Confidence Before Shipping

Alex is a solo developer building a customer support bot for a small e-commerce store. Before launching, Alex wants to make sure the bot handles the most common scenarios correctly and never promises things the store cannot deliver.

Alex opens the Test Suite, picks the support bot, and creates five test cases in about twenty minutes: two happy-path tests for refund requests and order tracking, one edge-case test for a misspelled product name, one guardrail test verifying the bot does not promise guaranteed refunds, and one error-handling test for when the order lookup API is down.

Alex runs the suite. Four tests pass, but the guardrail test fails — the bot responded to a refund request with "Sure, I will process that immediately" without asking for the order number or explaining the policy. The AI analysis suggests adding a system prompt instruction: "Always verify order eligibility before offering refunds."

Alex updates the prompt, re-runs the suite, and gets five out of five passes. The bot ships with confidence. Total time invested: one hour. Peace of mind: priceless.

QA Team: Preventing Regressions at Scale

A mid-sized SaaS company has twelve bots in production across customer support, sales, onboarding, and technical documentation. The QA team assigns one engineer to maintain test suites for all bots.

Every bot has between ten and twenty test cases covering core flows. Before any prompt update, the engineer runs the full suite for that bot. One day, a marketing team updates the sales bot's prompt to sound more enthusiastic. The test suite immediately catches a regression — the new prompt made the bot so enthusiastic that it started promising features that do not exist yet.

The AI analysis flags the issue as a prompt_edit fix with critical severity. The marketing team refines the wording, the engineer re-runs the suite, and all tests pass. A potentially embarrassing customer-facing mistake is caught before a single customer sees it.

Over three months, this process prevents four regressions, two policy violations, and one safety issue. The QA team reports a 100% regression-free deployment record.

Platform Operator: Quality as a KPI

A fintech platform with compliance requirements needs to demonstrate that its AI agents meet quality standards. The platform operator uses the dashboard widget to track quality metrics across all bots.

One week, the operator notices that the compliance bot's pass rate dropped from 94% to 71%. The widget shows the amber warning icon, prompting investigation. The operator drills into the run history and sees that three guardrail tests started failing after a recent prompt update.

The AI analysis identifies the root cause: the new prompt softened language around financial advice disclaimers, causing the guardrail that blocks unauthorized investment recommendations to trigger on legitimate questions. The fix suggestion is a guardrail_edit with high confidence.

The operator works with the engineering team to adjust the guardrail sensitivity, re-runs the suite, and sees the pass rate recover to 92%. The compliance report for that week shows a brief dip followed by rapid resolution — exactly the kind of proactive quality management that auditors want to see.


Bot Test Suite vs. External Testing Tools: A Detailed Comparison

To understand why the Bot Test Suite is the right choice for Rylvo users, let us compare it directly with typical external testing tools.

Production Runtime Fidelity. External tools mock or approximate LLM behavior because they cannot access your bot's internal runtime. They might send your prompt to a generic API and compare the response to an expected string. The Rylvo Bot Test Suite uses the exact botRuntime.ts path that production chat uses. It loads the same context, evaluates the same guardrails, assembles the same LLM messages, and calls the same API endpoints. The test result is literally a production conversation that happened in a testing context.

Multi-Turn Context. Most external tools test single-turn interactions because maintaining conversation state across tools is complex. The Bot Test Suite accumulates actual assistant responses as conversation history for subsequent turns. If turn three depends on what the bot said in turn two, the test validates that dependency authentically. This is essential for testing complex conversations like troubleshooting workflows, interview-style interactions, and guided form filling.

Guardrail Evaluation. External tools cannot see your guardrails because guardrails live inside Rylvo's runtime, not in the external tool. The Bot Test Suite has a dedicated guardrailCompliance scoring dimension that validates whether guardrails triggered when they should have and stayed quiet when they should not. You are not just testing the bot's answers — you are testing the bot's safety.

Real LLM Calls. External tools often use sandbox models or cached responses to save cost. The Bot Test Suite runs against your actual model using your BYOK key. If you use GPT-4 for production, the test uses GPT-4. If you use Claude, the test uses Claude. The behavior you test is the behavior your users experience.

Live Execution Preview. External tools return results in a batch after completion. You wait, then read a report. The Bot Test Suite streams every turn in real time through a chat-style interface. You watch the bot answer as it happens, spot issues immediately, and abort if something looks wrong. This transforms testing from a batch report into an interactive debugging session.

AI Fix Suggestions. When external tests fail, you get an error message. Figuring out why and how to fix it is manual work. The Bot Test Suite automatically analyzes failures with an LLM and produces actionable fix suggestions with severity ratings, confidence scores, and categorized fix types. The gap between "test failed" and "problem fixed" shrinks from hours to minutes.

Deep Trace Linkage. External testing tools operate in isolation from your observability stack. When a test fails, you have no way to inspect what happened inside the bot. The Bot Test Suite links every result to its production trace, giving you full request-level telemetry — the exact LLM request, the raw response, the guardrail evaluation, the prompt assembly, and the latency breakdown.

Prompt Regression Detection. External tools do not track whether a previously passing test now fails after an update. The Bot Test Suite's run history makes regressions obvious. You can compare any two runs and see exactly which tests changed from pass to fail.

Setup Complexity. External tools require integration, API keys, configuration, and maintenance. The Bot Test Suite is built into Rylvo. Zero configuration. Zero integration work. It works the moment you create your first test case.


Performance, Reliability, and Best Practices

Sequential Execution

Test cases run one at a time to preserve deterministic context and respect rate limits. Each test builds on a fresh conversation state, so there is no cross-test contamination. This is slower than parallel execution, but it ensures that a failure in test three does not affect test four. Parallel execution is on the roadmap for teams that need faster suites.

Abort Support

Long-running suites can be cancelled mid-flight with the abort button. Partial results are preserved, so you do not lose the work already done. This is useful when you spot a problem early and want to fix it before continuing.

Graceful Degradation

If the judge LLM fails — network error, rate limit, parsing issue — the system automatically falls back to deterministic scoring. Your test run never breaks because of a scoring problem. The results might be slightly less nuanced, but they are always available.

BYOK Validation

If the bot's model key is missing or invalid, the suite aborts immediately with a clear error message directing you to Settings → LLM Provider Keys. There is no confusing failure or silent timeout. The error tells you exactly what to fix.

Result Persistence

Every run and result is stored in Firestore with full metadata. Even if you close the browser, the results are there when you come back. Run history is permanent, creating a long-term quality archive.

Best Practices for Building Effective Test Suites

Start with happy paths. Before testing edge cases and guardrails, make sure your bot handles the most common scenarios correctly. Happy-path tests are your foundation. If these fail, nothing else matters.

Add guardrail tests early. Safety should not be an afterthought. Create guardrail tests as soon as you define your guardrails. Verify that they block harmful inputs, allow safe inputs, and do not create false positives.

Use mustContain for critical information. If there are phrases your bot must always include — like policy references, disclaimers, or contact information — codify them as mustContain assertions. This prevents omissions that could create liability.

Use mustNotContain for forbidden phrases. If there are phrases your bot must never say — like promises you cannot keep, sensitive information, or off-brand language — codify them as mustNotContain assertions. This is your safety net.

Test multi-turn flows for complex bots. If your bot handles conversations with multiple exchanges, create multi-turn tests. Single-turn tests miss the context accumulation that makes multi-turn conversations work.

Run the suite after every prompt change. Prompts are the single biggest influence on bot behavior. Re-running the suite after every change catches regressions before they reach customers.

Review AI analysis, even for passing runs. The AI analysis includes strengths and weaknesses for every run, not just failures. A passing run might still have weaknesses that suggest room for improvement. Low scores in tone or completeness are opportunities to make the bot better even when it is technically correct.


Roadmap: What Is Coming Next

The Bot Test Suite is actively evolving. Here is what is on the roadmap:

Scheduled test runs will let you set up cron-based automation, so your test suite runs automatically every day, every week, or on a custom schedule. This uses the existing ScheduledTaskDoc infrastructure from the Edge Cases feature.

CI/CD webhook endpoint will expose a POST /v1/bots/{id}/test endpoint with API-key authentication. GitHub Actions, Jenkins, GitLab CI, and other pipelines will be able to trigger test runs automatically on deployment and receive JSON results plus fix suggestions.

Parallel test execution will run multiple test cases simultaneously with isolated conversation state, dramatically reducing suite execution time for large test suites.

Auto-test generation from traces will let the Self-Evolving Agent mine failed production traces and automatically create new test cases. Your test suite will grow organically as your bot encounters new scenarios.

Regression alerts will compare the latest run to the previous run and send Slack or email notifications when scores degrade. You will know about quality issues within minutes of a test run completing.

LLM-based tone scoring will replace the current heuristic with a model-based evaluation of tone quality, empathy, brand voice, and appropriateness.

Synthetic user simulation will generate adversarial user personas that stress-test your bot with unexpected, creative, and challenging inputs.

A/B testing two bot versions will let you run the same suite against two different bot configurations and compare scores side by side. This is perfect for evaluating prompt variants before committing to one.

Public test result badges will generate embeddable SVG shields showing your bot's pass rate. Add them to your documentation, README, or status page to demonstrate quality to customers and stakeholders.


Frequently Asked Questions

Do I need a separate API key for testing? No. The Test Suite uses the same Bring Your Own Key (BYOK) that your bot uses for live chat. If your bot has a valid provider key configured in Settings → LLM Provider Keys, testing works automatically. If the key is missing, the suite aborts with a clear error telling you exactly where to add it. There is no extra cost for testing beyond the LLM calls the tests themselves consume.

Can I test multi-turn conversations? Yes, fully. The Test Suite was built with multi-turn flows as a first-class feature. Each turn builds on the actual conversation history from previous turns. You define the full dialogue sequence with alternating user and assistant messages, and the runner validates every turn in authentic context. This is one of the key differentiators from external tools that only test single-turn interactions.

What happens when a guardrail triggers during a test? The test records the guardrail block, the name of the blocking guardrail, and scores guardrail compliance accordingly. If you expected the guardrail to block — for example, you set expectedGuardrailAction to "block" for a harmful input — and the guardrail does block, the test passes the guardrail compliance dimension. If you did not expect a block and the guardrail triggers anyway, the test fails with a clear reason explaining which guardrail fired unexpectedly.

How is scoring different from simple keyword matching? The five-dimension scoring model evaluates relevance, accuracy, completeness, tone, and guardrail compliance. Relevance uses Jaccard word-overlap and per-turn matching. Accuracy checks hard assertions. Completeness measures thoroughness. Tone assesses appropriateness. Guardrail compliance validates safety behavior. You can also enable LLM-as-a-Judge for nuanced AI evaluation. Hard assertions are always non-negotiable and override the score.

Can I run tests automatically on a schedule? Not yet. Currently, tests must be triggered manually from the dashboard. Scheduled cron runs and CI/CD webhooks are on the roadmap and will be available soon. The workaround is to embed the executeTestSuite() client logic into your own automation scripts.

What if my test fails? Each failure includes a detailed reason — missing required phrase, forbidden phrase detected, guardrail mismatch, low relevance score, etc. Switch to the Analysis tab for AI-generated fix suggestions with severity ratings and confidence scores. Every result also links to its production trace for deep inspection. The typical workflow is: see failure, read analysis, apply suggestion, re-run, verify fix.

Are test results shared across my team? Yes. Test cases, runs, and results are stored in Firestore under your organization. Anyone with appropriate RBAC permissions can view and manage them. Admins create and delete tests. Operators run tests and view results. Viewers see results without modifying anything.

Can I disable a test without deleting it? Yes. Every test case has an enable or disable toggle. Disabled tests are skipped during runs but remain in your suite for future reactivation. This is useful for temporarily skipping flaky tests while you investigate, or for seasonal tests that only apply during certain campaigns.

Does the Test Suite work with all LLM providers? Yes. As long as your bot has a valid BYOK key for its configured model, the Test Suite works with any provider Rylvo supports — OpenAI, Anthropic, Google Gemini, Mistral, Cohere, and any provider accessible through OpenRouter.

How many tests can I create? It depends on your plan. Free plans allow 5 test cases per bot. Starter allows 25. Team allows 100. Business allows 250. Enterprise is unlimited. See the plan limits section above for the full breakdown.

What is the difference between deterministic scoring and LLM-as-a-Judge? Deterministic scoring uses fast, consistent heuristics — word overlap, length ratios, keyword matching. It is free, reliable, and works well for most needs. LLM-as-a-Judge uses a separate LLM for nuanced evaluation of relevance, tone, and completeness. It is more accurate and context-aware but costs one extra LLM call per test. You can choose per-run.

Can I export test results? Currently, results are viewable in the dashboard with full drill-down. A public API endpoint for programmatic access, CI integration, and export is planned for an upcoming release.

How long does a test run take? It depends on the number of test cases, the number of turns per test, and the speed of your LLM provider. A small suite of five single-turn tests might finish in under a minute. A large suite of twenty multi-turn tests might take five to ten minutes. The live preview lets you watch progress in real time.

What is BYOK and why does the Test Suite need it? BYOK stands for Bring Your Own Key. It means you provide your own LLM API key rather than using a platform-provided model. The Test Suite needs BYOK because it runs real LLM calls against your actual model. Without a valid key, there is no model to test against. This is the same key your bot uses for live chat, so if your bot works, testing works.


Ready to Test Your Bots with Confidence?

The Rylvo Bot Test Suite transforms AI agent quality assurance from an afterthought into a first-class engineering practice. It is not an external tool bolted onto the side of your platform. It is built into the heart of Rylvo, running inside the same production runtime that handles your live customer conversations.

Multi-turn conversation flows let you test real dialogues, not just single questions. Hard assertions lock down critical behaviors that must never break. The five-dimension scoring model gives you nuanced quality metrics, not just pass or fail. LLM-as-a-Judge provides deep AI evaluation when you need it. The live execution preview lets you watch your bot think in real time. AI-generated fix suggestions close the gap between "test failed" and "problem fixed" in minutes. Deep trace linkage connects every result to full production telemetry.

And it is all accessible from your dashboard, integrated with your existing bots, prompts, guardrails, and traces. No external tools to integrate. No mock layers to maintain. No guesswork about whether your bot will behave correctly.

Open the Test Suite in your dashboard today. Pick a bot, create your first test case, and watch your bot answer in real time. In fifteen minutes, you will know more about your bot's quality than you have ever known before.

Ship AI agents with confidence.

R

Rylvo Team

Rylvo Team

More Articles