RylvoRylvo

Agent Evolution

Improve botscontinuously.

Turn on Agent Evolution for any bot and let the system find improvements, run them through review, and deploy the winners — with full audit trails.

No silent regressions

Every rule carries applied / success / failure counters. If a rule hurts, it rolls back - automatically.

Humans decide what ships

Auto-promotion is a setting, not a default. Reviewers and operators stay in the loop for regulated bots.

Audit-grade history

Every rule change is versioned with diffs, author, and timestamp. Rollback to any point in seconds.

Safe by construction

Agents that learn, without learning recklessly

Self-improving systems are only useful if you can trust the improvements. Our evolution engine is built around model-agnostic evaluation, recognized governance frameworks, and guarantees that make every rule promotion reversible.

Model-agnostic evaluation — bring any judge

OpenAI
Anthropic Claude
Google Gemini
Mistral
Meta Llama
DeepSeek
+ any LLM-as-judge

You choose the evaluator. Run frontier models, open-source models, or a panel of both.

Aligned to AI governance frameworks

NIST AI RMF

Govern, Map, Measure, Manage primitives reflected in our control plane

EU AI Act readiness

Audit trails, human oversight gates, and risk tiering by use-case

ISO/IEC 42001

AI Management System practices — engagement on roadmap

OWASP LLM Top 10

Prompt-injection, output-handling, and supply-chain controls

MITRE ATLAS

Adversarial-ML threat modeling baked into reviewer panels

Safety guarantees — with honest status

We tag every commitment with its real production status. No vague badges, no implied attestations.

Shipping

Canary + auto-rollback

Every promoted rule rolls out to a small canary slice first. If win-rate, latency, or safety metrics regress, the rule rolls back without a human in the loop.

Shipping

Reviewer veto by default

Rules touching high-risk capabilities (writes, refunds, deletions) require an explicit human reviewer approval before promotion. The reviewer panel is configurable per-org.

Shipping

Your data stays your data

Reviewer feedback, evaluation traces, and learned rules are scoped to your org. We do not use your data to train or fine-tune any base foundation model.

Shipping

Tenant-isolated evals

Eval harnesses, golden datasets, and traces are encrypted at rest with AES-256-GCM and isolated by org_id at every layer of the stack.

Shipping

Append-only audit

Every rule promotion, reviewer decision, and rollback is recorded with actor, timestamp, and reasoning — queryable for compliance and post-mortems.

Designed for

Right-to-delete

Evaluation traces and learned rules can be exported or purged on request — designed to align with GDPR data-subject rights.

Framework alignment, not certification. Listing NIST AI RMF, EU AI Act, ISO/IEC 42001, OWASP LLM Top 10, and MITRE ATLAS reflects architectural alignment with these published frameworks — it is not a third-party attestation or certification. Model and brand names shown belong to their respective owners and indicate technical compatibility as evaluator/judge models, not endorsement or partnership.

How it works

Run. Log. Detect. Review. Learn. Improve.

The same loop runs continuously in the background of every bot. Nothing is hidden, nothing is hand-coded, and nothing ships without a trail.

Generations arena · fitness over time

mutate · judge · promote
bot=atlas · 10 generations · v10 champion · 94% pass · +32% vs v1evolution=ON

Champion

atlas · v10

promoted 4m ago · signed

policy v3
tone v5
kb v2

Evolution log

g10 promote

v10 champion +1.6%

g10 mutate

tone: tighten legal

g9 judge

cite-first rule

g9 mutate

policy cap $200

g8 promote

v8 champion +1.4%

g8 kill

c-8b regressed

g7 mutate

kb threshold 0.62

Success rate per generation

generation →

Next gen

v11.a

93%

policy tighten

v11.b

95%

tone +cite

v11.c

91%

kb top-K=6

v11.d

92%

reranker swap

auto-mutate · operator-judged · only promoted with signed evidence

Mutate

tone · policy · kb · tools

Judge

test suite + operator

Promote

signed · revertible

Compound

fitness rises per gen

Stage by stage

Six stages. One continuous loop.

Each stage of the evolution loop is observable, auditable, and independently configurable.

01 · stage

Run

Every conversation is a learning signal

Each bot run executes with the current set of active rules injected into its system prompt, tool filters, and output checks. Nothing is hidden - every decision is observable from the moment the bot replies.

How it works

  • Active rules auto-injected via agentHooks runtime
  • Per-bot rule catalog keeps tenants isolated
  • No manual prompt engineering to ship a fix
02 · stage

Log

Full-fidelity trace, not a sampled metric

Every run is recorded end-to-end: messages, tool calls, guardrail results, latency, cost, and outcome. The same trace powers debugging, compliance, and the reviewer pipeline.

How it works

  • Stored per-bot with bot_id, trace_id, and org scope
  • Messages + tool_calls + guardrail checks preserved
  • Success, failure, and timeout classification
03 · stage

Detect

Failure clusters surface without a human watching

A detection pass continually scans recent runs for failure signals - timeouts, guardrail blocks, repeated escalations, sentiment drops, and behavioral loops. Matches are clustered into failure patterns with affected-run counts.

How it works

  • Timeouts, guardrail triggers, repeat escalations
  • Frustration / urgency sentiment keywords
  • Loop detection and repetition patterns
  • Automatically grouped into failure clusters
04 · stage

Review

Four specialized reviewers - not one opinion

When a failure cluster crosses its trigger threshold, four LLM reviewers independently analyze it. Each focuses on a single axis - trust, outcome, safety, efficiency - producing insights with confidence scores.

How it works

  • Trust - did the bot follow policy and stay in scope?
  • Outcome - did the user actually get resolution?
  • Safety - any guardrail near-miss or risk exposure?
  • Efficiency - wasted tool calls or bloated context?
05 · stage

Learn

Insights become versioned rules

The rule generator deduplicates insights across reviewers (Jaccard similarity), scores confidence on a weighted formula, and proposes candidate rules. High-confidence candidates can auto-promote; everything else waits for a human reviewer.

How it works

  • Jaccard de-duplication across reviewer insights
  • Confidence = insight + reviewer agreement + pattern frequency
  • Auto-promote only when above org threshold
  • Max-rule limit evicts lowest-effectiveness active rule
06 · stage

Improve

Measured, rollback-safe deployment

Active rules ship into the next run automatically. Every application is tracked for success and failure, effectiveness is recomputed continually, and underperforming rules are auto-rolled-back with full version history preserved.

How it works

  • Applied / success / failure counters per rule
  • Auto-rollback when failureRate > threshold (≥5 runs)
  • Version history keeps every rule edit + diff
  • Rollback to any prior version from one click

Under the hood

Four reviewers. One decision.

The reviewer panel

Four specialists, one clustered failure.

Trust Reviewer

Asks: Did the bot respect policy, tone, and scope constraints for this org and workflow?

Flags: Policy violations, tone drift, out-of-scope commitments.

Outcome Reviewer

Asks: Did the user actually get resolution, or did the run stall, escalate, or confuse?

Flags: Unresolved intents, missed handoffs, incomplete actions.

Safety Reviewer

Asks: Were any guardrails near-missed or sensitive topics mishandled?

Flags: Near-miss guardrail, PII leakage, crisis-language gaps.

Efficiency Reviewer

Asks: Were tool calls, context, and latency used economically for this outcome?

Flags: Redundant tool calls, context bloat, unnecessary retries.

Rule lifecycle

Every rule has state, history, and a rollback path.

01

Proposed

Produced by reviewers, not yet scored.

02

Candidate

Deduplicated, scored, awaiting promotion.

03

Active

Injected into runs, effectiveness tracked.

04

Underperforming

Effectiveness below threshold - rollback queued.

05

Disabled

Auto- or manual-rolled-back. Full history retained.

Answers

What teams usually ask

Can I turn off auto-promotion?

Yes - every bot has its own Self-Evolving config. You can require manual approval before a candidate rule goes live, and even disable rule generation entirely for regulated workloads.

What happens if a rule makes the bot worse?

We track applied, success, and failure counters on every rule. Once a rule has been applied five or more times and its failure rate crosses the org threshold, it's automatically disabled. The full version history stays intact so you can inspect, edit, or revert.

Is the learning tenant-isolated?

Yes. Rules are scoped to the bot and the organization. One tenant's failure patterns never leak into another tenant's prompt.

How does this work with Mission Control?

Mission Control supervises live conversations in real time. Agent Evolution turns their after-the-fact decisions - approvals, interventions, overrides - into durable rules your bots apply on the next run.

Deep dive

Why agents should improve themselves, on a leash

Mutation, not magic

Every generation spawns a small set of challengers — tone tweaked, policy tightened, KB threshold adjusted, reranker swapped. The mutations are structured operations on versioned blocks, not free-form prompt rewrites, so improvements are explainable and reversible.

Judged by tests and humans, not vibes

Challengers compete on the same regression suite the rest of Rylvo uses — plus operator feedback from Mission Control captured as rules. A challenger only earns promotion when it beats the current champion on real metrics, signed off and logged.

Promotion is a first-class event, not a deploy

Every champion swap is a signed, append-only event with the full diff and the test evidence that earned the win. Roll back in one click; pin a previous champion per environment; audit the entire lineage of how your bot got smarter — without ever touching a CI pipeline.

Compounding gains, not flat plateaus

Small wins per generation compound. The fitness chart goes up because each generation seeds the next with the best of the last — your bots get measurably better while you sleep, with full transparency into what changed, why it won, and how to reverse it if you change your mind.

Ready when you are

Ship an agent that keeps getting better.

Turn on Agent Evolution for any bot during setup, or flip it on for an existing bot from the Evolution dashboard. Reviewers, versioning, and rollback are on the house.