Agent Evolution
Improve botscontinuously.
Turn on Agent Evolution for any bot and let the system find improvements, run them through review, and deploy the winners — with full audit trails.
No silent regressions
Every rule carries applied / success / failure counters. If a rule hurts, it rolls back - automatically.
Humans decide what ships
Auto-promotion is a setting, not a default. Reviewers and operators stay in the loop for regulated bots.
Audit-grade history
Every rule change is versioned with diffs, author, and timestamp. Rollback to any point in seconds.
Safe by construction
Agents that learn, without learning recklessly
Self-improving systems are only useful if you can trust the improvements. Our evolution engine is built around model-agnostic evaluation, recognized governance frameworks, and guarantees that make every rule promotion reversible.
Model-agnostic evaluation — bring any judge
You choose the evaluator. Run frontier models, open-source models, or a panel of both.
Aligned to AI governance frameworks
NIST AI RMF
Govern, Map, Measure, Manage primitives reflected in our control plane
EU AI Act readiness
Audit trails, human oversight gates, and risk tiering by use-case
ISO/IEC 42001
AI Management System practices — engagement on roadmap
OWASP LLM Top 10
Prompt-injection, output-handling, and supply-chain controls
MITRE ATLAS
Adversarial-ML threat modeling baked into reviewer panels
Safety guarantees — with honest status
We tag every commitment with its real production status. No vague badges, no implied attestations.
Canary + auto-rollback
Every promoted rule rolls out to a small canary slice first. If win-rate, latency, or safety metrics regress, the rule rolls back without a human in the loop.
Reviewer veto by default
Rules touching high-risk capabilities (writes, refunds, deletions) require an explicit human reviewer approval before promotion. The reviewer panel is configurable per-org.
Your data stays your data
Reviewer feedback, evaluation traces, and learned rules are scoped to your org. We do not use your data to train or fine-tune any base foundation model.
Tenant-isolated evals
Eval harnesses, golden datasets, and traces are encrypted at rest with AES-256-GCM and isolated by org_id at every layer of the stack.
Append-only audit
Every rule promotion, reviewer decision, and rollback is recorded with actor, timestamp, and reasoning — queryable for compliance and post-mortems.
Right-to-delete
Evaluation traces and learned rules can be exported or purged on request — designed to align with GDPR data-subject rights.
Framework alignment, not certification. Listing NIST AI RMF, EU AI Act, ISO/IEC 42001, OWASP LLM Top 10, and MITRE ATLAS reflects architectural alignment with these published frameworks — it is not a third-party attestation or certification. Model and brand names shown belong to their respective owners and indicate technical compatibility as evaluator/judge models, not endorsement or partnership.
How it works
Run. Log. Detect. Review. Learn. Improve.
The same loop runs continuously in the background of every bot. Nothing is hidden, nothing is hand-coded, and nothing ships without a trail.
Generations arena · fitness over time
mutate · judge · promoteChampion
atlas · v10
promoted 4m ago · signed
Evolution log
g10 promote
v10 champion +1.6%
g10 mutate
tone: tighten legal
g9 judge
cite-first rule
g9 mutate
policy cap $200
g8 promote
v8 champion +1.4%
g8 kill
c-8b regressed
g7 mutate
kb threshold 0.62
Success rate per generation
generation →
Next gen
v11.a
93%
policy tighten
v11.b
95%
tone +cite
v11.c
91%
kb top-K=6
v11.d
92%
reranker swap
auto-mutate · operator-judged · only promoted with signed evidence
Mutate
tone · policy · kb · tools
Judge
test suite + operator
Promote
signed · revertible
Compound
fitness rises per gen
Stage by stage
Six stages. One continuous loop.
Each stage of the evolution loop is observable, auditable, and independently configurable.
Run
Every conversation is a learning signal
Each bot run executes with the current set of active rules injected into its system prompt, tool filters, and output checks. Nothing is hidden - every decision is observable from the moment the bot replies.
How it works
- Active rules auto-injected via agentHooks runtime
- Per-bot rule catalog keeps tenants isolated
- No manual prompt engineering to ship a fix
Log
Full-fidelity trace, not a sampled metric
Every run is recorded end-to-end: messages, tool calls, guardrail results, latency, cost, and outcome. The same trace powers debugging, compliance, and the reviewer pipeline.
How it works
- Stored per-bot with bot_id, trace_id, and org scope
- Messages + tool_calls + guardrail checks preserved
- Success, failure, and timeout classification
Detect
Failure clusters surface without a human watching
A detection pass continually scans recent runs for failure signals - timeouts, guardrail blocks, repeated escalations, sentiment drops, and behavioral loops. Matches are clustered into failure patterns with affected-run counts.
How it works
- Timeouts, guardrail triggers, repeat escalations
- Frustration / urgency sentiment keywords
- Loop detection and repetition patterns
- Automatically grouped into failure clusters
Review
Four specialized reviewers - not one opinion
When a failure cluster crosses its trigger threshold, four LLM reviewers independently analyze it. Each focuses on a single axis - trust, outcome, safety, efficiency - producing insights with confidence scores.
How it works
- Trust - did the bot follow policy and stay in scope?
- Outcome - did the user actually get resolution?
- Safety - any guardrail near-miss or risk exposure?
- Efficiency - wasted tool calls or bloated context?
Learn
Insights become versioned rules
The rule generator deduplicates insights across reviewers (Jaccard similarity), scores confidence on a weighted formula, and proposes candidate rules. High-confidence candidates can auto-promote; everything else waits for a human reviewer.
How it works
- Jaccard de-duplication across reviewer insights
- Confidence = insight + reviewer agreement + pattern frequency
- Auto-promote only when above org threshold
- Max-rule limit evicts lowest-effectiveness active rule
Improve
Measured, rollback-safe deployment
Active rules ship into the next run automatically. Every application is tracked for success and failure, effectiveness is recomputed continually, and underperforming rules are auto-rolled-back with full version history preserved.
How it works
- Applied / success / failure counters per rule
- Auto-rollback when failureRate > threshold (≥5 runs)
- Version history keeps every rule edit + diff
- Rollback to any prior version from one click
Under the hood
Four reviewers. One decision.
The reviewer panel
Four specialists, one clustered failure.
Trust Reviewer
Asks: Did the bot respect policy, tone, and scope constraints for this org and workflow?
Flags: Policy violations, tone drift, out-of-scope commitments.
Outcome Reviewer
Asks: Did the user actually get resolution, or did the run stall, escalate, or confuse?
Flags: Unresolved intents, missed handoffs, incomplete actions.
Safety Reviewer
Asks: Were any guardrails near-missed or sensitive topics mishandled?
Flags: Near-miss guardrail, PII leakage, crisis-language gaps.
Efficiency Reviewer
Asks: Were tool calls, context, and latency used economically for this outcome?
Flags: Redundant tool calls, context bloat, unnecessary retries.
Rule lifecycle
Every rule has state, history, and a rollback path.
Proposed
Produced by reviewers, not yet scored.
Candidate
Deduplicated, scored, awaiting promotion.
Active
Injected into runs, effectiveness tracked.
Underperforming
Effectiveness below threshold - rollback queued.
Disabled
Auto- or manual-rolled-back. Full history retained.
Answers
What teams usually ask
Can I turn off auto-promotion?
Yes - every bot has its own Self-Evolving config. You can require manual approval before a candidate rule goes live, and even disable rule generation entirely for regulated workloads.
What happens if a rule makes the bot worse?
We track applied, success, and failure counters on every rule. Once a rule has been applied five or more times and its failure rate crosses the org threshold, it's automatically disabled. The full version history stays intact so you can inspect, edit, or revert.
Is the learning tenant-isolated?
Yes. Rules are scoped to the bot and the organization. One tenant's failure patterns never leak into another tenant's prompt.
How does this work with Mission Control?
Mission Control supervises live conversations in real time. Agent Evolution turns their after-the-fact decisions - approvals, interventions, overrides - into durable rules your bots apply on the next run.
Deep dive
Why agents should improve themselves, on a leash
Mutation, not magic
Every generation spawns a small set of challengers — tone tweaked, policy tightened, KB threshold adjusted, reranker swapped. The mutations are structured operations on versioned blocks, not free-form prompt rewrites, so improvements are explainable and reversible.
Judged by tests and humans, not vibes
Challengers compete on the same regression suite the rest of Rylvo uses — plus operator feedback from Mission Control captured as rules. A challenger only earns promotion when it beats the current champion on real metrics, signed off and logged.
Promotion is a first-class event, not a deploy
Every champion swap is a signed, append-only event with the full diff and the test evidence that earned the win. Roll back in one click; pin a previous champion per environment; audit the entire lineage of how your bot got smarter — without ever touching a CI pipeline.
Compounding gains, not flat plateaus
Small wins per generation compound. The fitness chart goes up because each generation seeds the next with the best of the last — your bots get measurably better while you sleep, with full transparency into what changed, why it won, and how to reverse it if you change your mind.
Ready when you are
Ship an agent that keeps getting better.
Turn on Agent Evolution for any bot during setup, or flip it on for an existing bot from the Evolution dashboard. Reviewers, versioning, and rollback are on the house.
