Observe, Learn, and Improve Subsystems¶
Agent Baton includes a closed-loop learning pipeline that automatically collects execution data, discovers patterns in that data, and proposes (or auto-applies) improvements to future plans. The three subsystems -- Observe, Learn, and Improve -- form a feedback cycle that makes the orchestration system better with every task it runs.
THE LEARNING LOOP
+-----------+ +----------+ +-----------+
| OBSERVE | ----> | LEARN | ----> | IMPROVE |
| | | | | |
| traces | | patterns | | scores |
| usage | | budget | | evolution |
| retros | | recs | | VCS |
| telemetry | | | | loop |
| profiles | | | | rollback |
+-----------+ +----------+ +-----------+
^ |
| |
+--------- better next execution -------+
execution --> trace --> usage --> retro --> pattern --> score
--> recommendation --> experiment --> (auto-apply | escalate)
--> rollback if degraded --> circuit breaker
1. Observe Subsystem¶
Package: agent_baton.core.observe
The observe layer captures everything that happens during an orchestrated task execution. It produces the raw data that the learn and improve layers consume.
1.1 Trace Recorder¶
Module: agent_baton/core/observe/trace.py
Classes: TraceRecorder, TraceRenderer
Models: agent_baton/models/trace.py -- TaskTrace, TraceEvent
The trace recorder captures a DAG of timestamped events for each orchestrated task. Each event records what happened, which agent was involved, and at which phase/step of the plan.
What Gets Traced¶
Every significant execution event is recorded as a TraceEvent:
event_type |
Description |
|---|---|
agent_start |
Agent dispatch begins |
agent_complete |
Agent finishes its work |
gate_check |
Gate check initiated |
gate_result |
Gate check outcome (PASS/FAIL) |
escalation |
Issue escalated to human |
replan |
Plan amended during execution |
file_read |
File read by an agent |
file_write |
File written by an agent |
decision |
Decision point reached |
TraceEvent Model¶
@dataclass
class TraceEvent:
timestamp: str # ISO 8601
event_type: str # see table above
agent_name: str | None # which agent, if applicable
phase: int # plan phase number
step: int # step within phase
details: dict # event-specific metadata
duration_seconds: float | None # wall-clock duration
TaskTrace Model¶
@dataclass
class TaskTrace:
task_id: str
plan_snapshot: dict # the plan as it existed at start
events: list[TraceEvent] # ordered event list
started_at: str # ISO 8601
completed_at: str | None # ISO 8601, set on completion
outcome: str | None # "SHIP", "REVISE", etc.
Storage¶
Traces are persisted as JSON files at:
Key Methods¶
| Method | Purpose |
|---|---|
start_trace(task_id, plan_snapshot) |
Create a new in-memory trace |
record_event(trace, event_type, ...) |
Append an event to a trace |
complete_trace(trace, outcome) |
Finalize and write trace to disk |
load_trace(task_id) |
Load a trace from disk by task ID |
list_traces(count) |
List the N most recent trace files |
get_last_trace() |
Load the most recently modified trace |
TraceRenderer¶
The TraceRenderer class formats traces for human consumption:
render_timeline(trace)-- Full timeline grouped by phase, showing time, event type, agent name, duration, and primary detail.render_summary(trace)-- Compact one-screen summary with task ID, outcome, duration, event count, agents involved, gate results, and event type breakdown.
1.2 Usage Logger¶
Module: agent_baton/core/observe/usage.py
Class: UsageLogger
Models: agent_baton/models/usage.py -- TaskUsageRecord, AgentUsageRecord
The usage logger tracks token consumption, agent utilization, retry counts, gate results, and outcomes for every orchestrated task.
AgentUsageRecord¶
@dataclass
class AgentUsageRecord:
name: str # agent name
model: str = "sonnet" # LLM model used
steps: int = 1 # steps executed
retries: int = 0 # retry count
gate_results: list[str] = [] # "PASS" / "FAIL" per gate
estimated_tokens: int = 0 # approximate token usage
duration_seconds: float = 0.0 # wall-clock time
TaskUsageRecord¶
@dataclass
class TaskUsageRecord:
task_id: str
timestamp: str # ISO 8601
agents_used: list[AgentUsageRecord]
total_agents: int = 0
risk_level: str = "LOW" # LOW / MEDIUM / HIGH
sequencing_mode: str = "phased_delivery"
gates_passed: int = 0
gates_failed: int = 0
outcome: str = "" # "SHIP", "SHIP WITH NOTES", "REVISE", "BLOCK"
notes: str = ""
Storage¶
Usage records are appended as JSONL (newline-delimited JSON) to:
JSONL was chosen so records can be appended without loading the entire file into memory.
Key Methods¶
| Method | Purpose |
|---|---|
log(record) |
Append a TaskUsageRecord to the log |
read_all() |
Read all records from the log file |
read_recent(count) |
Read the N most recent records |
summary() |
Aggregate stats: total tasks, tokens, agents, outcomes, risk distribution |
agent_stats(agent_name) |
Per-agent stats: times used, retries, gate pass rate, models |
Summary Aggregation¶
summary() returns:
| Field | Description |
|---|---|
total_tasks |
Number of tasks recorded |
total_agents_used |
Sum of agents across all tasks |
total_estimated_tokens |
Total token consumption |
avg_agents_per_task |
Mean agents per task |
avg_retries_per_task |
Mean retries per task |
outcome_counts |
Dict of outcome -> count |
risk_level_counts |
Dict of risk level -> count |
agent_frequency |
Dict of agent name -> usage count |
1.3 Dashboard Generator¶
Module: agent_baton/core/observe/dashboard.py
Class: DashboardGenerator
The dashboard generator produces a comprehensive markdown document from
usage logs and telemetry data. It integrates data from both UsageLogger
and AgentTelemetry into a single report.
Dashboard Sections¶
- Overview -- Total tasks, agent uses, estimated tokens, avg agents/task, avg retries/agent, gate pass rate.
- Outcomes -- Outcome distribution table (SHIP, REVISE, BLOCK, etc.).
- Risk Distribution -- Tasks by risk level (LOW, MEDIUM, HIGH, CRITICAL).
- Model Mix -- Which LLM models were used and how often.
- Agent Utilization -- Per-agent usage count and average retry rate.
- Sequencing Modes -- Distribution of sequencing modes across tasks.
- Telemetry (if data exists) -- Event counts by agent and by type, files read/written counts.
Storage¶
Dashboards are written to:
Key Methods¶
| Method | Purpose |
|---|---|
generate() |
Return the full dashboard as a markdown string |
write(path) |
Write the dashboard to disk (default path above) |
1.4 Retrospective Engine¶
Module: agent_baton/core/observe/retrospective.py
Class: RetrospectiveEngine
Models: agent_baton/models/retrospective.py -- Retrospective,
AgentOutcome, KnowledgeGap, RosterRecommendation, SequencingNote
Models: agent_baton/models/feedback.py -- RetrospectiveFeedback
The retrospective engine generates structured post-task analysis that captures both quantitative metrics and qualitative signals. It is the primary bridge between observation and learning.
Retrospective Model¶
@dataclass
class Retrospective:
task_id: str
task_name: str
timestamp: str
# Quantitative metrics
agent_count: int
retry_count: int
gates_passed: int
gates_failed: int
risk_level: str
estimated_tokens: int
# Qualitative signals
what_worked: list[AgentOutcome] # agents that performed well
what_didnt: list[AgentOutcome] # agents that had issues
knowledge_gaps: list[KnowledgeGapRecord] # missing knowledge identified
roster_recommendations: list[RosterRecommendation]
sequencing_notes: list[SequencingNote]
Supporting Models¶
AgentOutcome-- Per-agent result withname,worked_well,issues, androot_causefields.KnowledgeGapRecord-- Structured record of missing knowledge withdescription,gap_type,resolution,resolution_detail,agent_name,task_summary,task_type.RosterRecommendation-- Suggested agent roster change:action(create/improve/remove),target(agent or pack name),reason.SequencingNote-- Phase-level observation:phase,observation,keep(whether the phase/gate should be retained).
Implicit Knowledge Gap Detection¶
The engine scans narrative text in what_didnt outcomes for phrases that
indicate implicit knowledge gaps. Detected phrases include:
- "lacked context"
- "didn't know about"
- "assumed incorrectly"
- "no knowledge of"
- "unaware of"
- "missing context"
- "lacked/lacking information"
Implicit gaps are merged with explicit gaps (from KNOWLEDGE_GAP signals
during execution), with explicit gaps taking precedence when descriptions
match.
Storage¶
Retrospectives are persisted as two files:
.claude/team-context/retrospectives/<task_id>.md # human-readable
.claude/team-context/retrospectives/<task_id>.json # machine-readable sidecar
RetrospectiveFeedback¶
The RetrospectiveFeedback model aggregates actionable signals from
recent retrospectives for consumption by the planner:
@dataclass
class RetrospectiveFeedback:
roster_recommendations: list[RosterRecommendation]
knowledge_gaps: list[KnowledgeGapRecord]
sequencing_notes: list[SequencingNote]
source_count: int
def agents_to_drop() -> set[str] # agents recommended for removal
def agents_to_prefer() -> set[str] # agents recommended for preferring
def has_feedback() -> bool # any actionable feedback exists?
Key Methods¶
| Method | Purpose |
|---|---|
generate_from_usage(usage, ...) |
Generate a retrospective from a usage record plus qualitative input |
save(retro) |
Write retrospective to disk (markdown + JSON sidecar) |
load(task_id) |
Read a retrospective markdown by task ID |
list_retrospectives() |
List all retrospective files |
list_recent(count) |
Return the N most recent retrospective files |
search(keyword) |
Find retrospectives containing a keyword |
load_recent_feedback(limit) |
Aggregate structured feedback from recent JSON sidecars |
extract_recommendations() |
Extract roster recommendations from all retrospective markdown |
1.5 Context Profiler¶
Module: agent_baton/core/observe/context_profiler.py
Class: ContextProfiler
Models: agent_baton/models/context_profile.py -- TaskContextProfile,
AgentContextProfile
The context profiler analyses trace data to compute per-agent context efficiency metrics. It helps identify agents that are reading too broadly (wasting context window tokens) or not reading enough (missing relevant files).
Efficiency Score¶
The core metric is the efficiency score:
An agent that reads 10 files but writes 2 has an efficiency of 0.2. An agent that reads 5 files and writes 5 has an efficiency of 1.0. Agents with scores below 0.3 are flagged as "broad readers."
Redundancy Rate¶
At the task level, redundancy measures how many file reads were duplicated across agents:
A redundancy rate of 0.4 means 40% of file reads were reading files already read by another agent in the same task.
Models¶
@dataclass
class AgentContextProfile:
agent_name: str
files_read: list[str]
files_written: list[str]
files_referenced: list[str] # files both read AND written
context_tokens_estimate: int # approx input tokens (reads * 500)
output_tokens_estimate: int # approx output tokens (writes * 200)
efficiency_score: float # writes / max(reads, 1)
@dataclass
class TaskContextProfile:
task_id: str
agent_profiles: list[AgentContextProfile]
total_files_read: int
unique_files_read: int
redundant_reads: int
redundancy_rate: float
created_at: str
Storage¶
Context profiles are persisted as JSON files at:
Key Methods¶
| Method | Purpose |
|---|---|
profile_task(task_id) |
Build a profile from a task's trace data |
save_profile(profile) |
Persist a profile to disk |
load_profile(task_id) |
Load a profile from disk |
list_profiles(count) |
List the N most recent profile files |
agent_summary(agent_name) |
Aggregate stats for an agent across all profiles |
generate_report() |
Produce a markdown context efficiency report |
1.6 Telemetry¶
Module: agent_baton/core/observe/telemetry.py
Class: AgentTelemetry
Data class: TelemetryEvent (defined in same module)
Telemetry captures real-time agent tool-call events at a finer granularity than traces. Where traces capture high-level orchestration events, telemetry captures individual tool invocations.
TelemetryEvent¶
@dataclass
class TelemetryEvent:
timestamp: str
agent_name: str
event_type: str # "tool_call", "file_read", "file_write", "bash_exec", "error"
tool_name: str # which tool was called
file_path: str # file involved, if applicable
duration_ms: int # wall-clock milliseconds
details: str # additional context
Storage¶
Telemetry events are appended as JSONL to:
Key Methods¶
| Method | Purpose |
|---|---|
log_event(event) |
Append a telemetry event to the log |
read_events(agent_name) |
Read all events, optionally filtered by agent |
read_recent(count) |
Read the N most recent events |
summary() |
Aggregate: total events, by agent, by type, files read/written |
clear() |
Clear the telemetry log (between tasks) |
1.7 Data Archiver¶
Module: agent_baton/core/observe/archiver.py
Class: DataArchiver
The data archiver provides retention-based cleanup for all execution
artifacts. It keeps the .claude/team-context/ directory manageable as
task volume grows.
What Gets Cleaned¶
| Category | Path | File Pattern |
|---|---|---|
| Executions | executions/ |
Subdirectories |
| Traces | traces/ |
*.json |
| Events | events/ |
*.jsonl |
| Retrospectives | retrospectives/ |
All files (.md + .json) |
| Context Profiles | context-profiles/ |
*.json |
Additionally, JSONL logs (telemetry.jsonl) are rotated by keeping only
the last 10,000 lines.
Key Methods¶
| Method | Purpose |
|---|---|
scan(retention_days) |
Read-only scan for files older than retention period |
cleanup(retention_days, dry_run) |
Remove old files (with dry-run option) |
summary(retention_days) |
Human-readable report of what would be cleaned |
Default retention is 90 days.
1.8 Observe CLI Commands¶
| Command | Description |
|---|---|
baton trace |
List recent traces |
baton trace TASK_ID |
Show timeline for a specific task |
baton trace --last |
Show timeline for the most recent task |
baton trace --summary TASK_ID |
Show compact summary |
baton usage |
Show aggregate usage summary |
baton usage --recent N |
Show the N most recent usage records |
baton usage --agent NAME |
Show stats for a specific agent |
baton dashboard |
Print the usage dashboard |
baton dashboard --write |
Write the dashboard to disk |
baton telemetry |
Show telemetry summary |
baton telemetry --agent NAME |
Show events for a specific agent |
baton telemetry --recent N |
Show the N most recent events |
baton telemetry --clear |
Clear the telemetry log |
baton context-profile |
List recent context profiles |
baton context-profile TASK_ID |
Show profile for a specific task |
baton context-profile --generate TASK_ID |
Generate a profile from trace data |
baton context-profile --agent NAME |
Show aggregate stats for an agent |
baton context-profile --report |
Print a full context efficiency report |
baton retro |
List recent retrospectives |
baton retro --task-id ID |
Show a specific retrospective |
baton retro --search KEYWORD |
Search retrospectives by keyword |
baton retro --recommendations |
Extract roster recommendations |
baton cleanup |
Remove old execution artifacts |
baton cleanup --retention-days N |
Set retention period (default 90 days) |
baton cleanup --dry-run |
Preview what would be removed |
1.9 Observe API Endpoints¶
Module: agent_baton/api/routes/observe.py
| Endpoint | Method | Description |
|---|---|---|
/dashboard |
GET | Pre-rendered usage dashboard as markdown |
/traces/{task_id} |
GET | Structured trace JSON for a completed task |
/usage |
GET | Usage records with optional since and agent filters |
2. Learn Subsystem¶
Package: agent_baton.core.learn
The learn layer analyses the data collected by the observe layer and discovers actionable patterns. It turns raw execution history into knowledge the system can use for better planning.
2.1 Pattern Learner¶
Module: agent_baton/core/learn/pattern_learner.py
Class: PatternLearner
Model: agent_baton/models/pattern.py -- LearnedPattern
Status: Experimental -- built and tested but not yet validated with real usage data.
The pattern learner analyses completed task records from the usage log,
groups them by sequencing_mode (used as a proxy for task type), and
computes per-group statistics. Groups that meet minimum thresholds are
emitted as LearnedPattern objects.
How Patterns Are Discovered¶
- Read all
TaskUsageRecordentries from the usage log. - Group records by
sequencing_mode. - For each group with >=
min_sample_size(default 5) records: - Compute
success_rate= count(outcome == "SHIP") / total. - Find the most common agent combination (sorted tuple of agent names).
- Compute
avg_token_costfrom successful tasks (falls back to all if none succeeded). - Compute
avg_retriesandavg_gate_rate. - Apply the confidence formula.
- Filter by
min_confidence(default 0.7). - Return patterns sorted by confidence descending.
Confidence Formula¶
This ramps linearly from 0 at 0 samples to 1.0 at 15+ samples with perfect success. The divisor 15 is a calibration constant.
LearnedPattern Model¶
@dataclass
class LearnedPattern:
pattern_id: str # e.g. "phased_delivery-001"
task_type: str # derived from sequencing_mode
stack: str | None # tech stack qualifier, None = any
recommended_template: str # human-readable workflow description
recommended_agents: list[str] # best agent combination observed
confidence: float # [0.0, 1.0]
sample_size: int # records that contributed
success_rate: float # fraction with outcome=="SHIP"
avg_token_cost: int # mean tokens across successful tasks
evidence: list[str] # contributing task_ids
created_at: str # ISO 8601
updated_at: str # ISO 8601
Storage¶
Patterns are persisted to:
Knowledge Gap Querying¶
The pattern learner also provides knowledge_gaps_for(agent_name,
task_type) which reads retrospective JSON sidecars and returns prior
knowledge gap records matching a given agent. This enables the planner
to attach relevant knowledge packs to agents that previously lacked
context.
Key Methods¶
| Method | Purpose |
|---|---|
analyze(min_sample_size, min_confidence) |
Read usage log and return qualifying patterns |
refresh(min_sample_size, min_confidence) |
Re-analyse and write to learned-patterns.json |
load_patterns() |
Read patterns from disk |
get_patterns_for_task(task_type, stack) |
Filter stored patterns by task type and stack |
recommend_sequencing(task_type) |
Return the optimal agent sequence for a task type |
knowledge_gaps_for(agent_name, task_type) |
Query prior knowledge gaps for an agent |
generate_report() |
Generate a markdown report of all patterns |
2.2 Budget Tuner¶
Module: agent_baton/core/learn/budget_tuner.py
Class: BudgetTuner
Model: agent_baton/models/budget.py -- BudgetRecommendation
Status: Experimental -- built and tested but not yet validated with real usage data.
The budget tuner analyses historical token usage and recommends budget tier changes. It ensures tasks are not over-provisioned (wasting tokens) or under-provisioned (running out of budget).
Budget Tiers¶
| Tier | Token Range | Midpoint |
|---|---|---|
| Lean | 0 -- 50,000 | 25,000 |
| Standard | 50,001 -- 500,000 | 275,000 |
| Full | 500,001+ | 750,000 |
Recommendation Rules¶
For each group of tasks (grouped by sequencing_mode) with >= 3 records:
- Upgrade: median usage > 80% of current tier's upper bound.
- Downgrade: 95th-percentile usage < current tier's lower bound.
Confidence Formula¶
Auto-Apply Guardrail¶
Only downgrade recommendations are eligible for auto-application. Budget upgrades (to more expensive tiers) always require human approval.
BudgetRecommendation Model¶
@dataclass
class BudgetRecommendation:
task_type: str # grouping key
current_tier: str # inferred from median usage
recommended_tier: str # after applying rules
reason: str # human-readable explanation
avg_tokens_used: int
median_tokens_used: int
p95_tokens_used: int
sample_size: int
confidence: float
potential_savings: int # estimated tokens saved per task (downgrades only)
Storage¶
Recommendations are persisted to:
Key Methods¶
| Method | Purpose |
|---|---|
analyze() |
Read usage log and return budget recommendations |
recommend() |
Generate a human-readable markdown report |
save_recommendations() |
Analyse and write to disk |
auto_apply_recommendations(threshold) |
Return only downgrade recommendations above confidence threshold |
load_recommendations() |
Read previously saved recommendations |
2.3 Unified Recommender¶
Module: agent_baton/core/learn/recommender.py
Class: Recommender
Model: agent_baton/models/improvement.py -- Recommendation
The recommender aggregates analysis from all engines (scorer, pattern learner, budget tuner, evolution engine) into a unified, deduplicated, ranked list of recommendations.
Recommendation Categories¶
| Category | Description |
|---|---|
agent_prompt |
Changes to agent prompt definitions |
budget_tier |
Budget tier adjustments |
routing |
Agent routing weight changes |
sequencing |
Task sequencing pattern changes |
gate_config |
Gate configuration changes |
roster |
Agent roster changes |
Guardrail Enforcement¶
The recommender enforces strict safety rules on what can be auto-applied:
| Category | Auto-Apply Rule |
|---|---|
agent_prompt |
NEVER auto-apply. Always escalate to human. |
budget_tier |
Only downgrades (cheaper tier). Never auto-upgrade. |
routing |
Only if confidence >= 0.9 and additive change. |
sequencing |
Only if confidence >= 0.8 and success rate >= 0.9. |
| Other | Only LOW-risk recommendations auto-apply. |
Recommendation Model¶
@dataclass
class Recommendation:
rec_id: str # unique identifier
category: str # see categories above
target: str # agent name, task type, etc.
action: str # "downgrade budget", "evolve prompt", etc.
description: str # human-readable explanation
evidence: list[str] # supporting data references
confidence: float # [0.0, 1.0]
risk: str # "low", "medium", "high"
auto_applicable: bool # whether auto-apply is permitted
proposed_change: dict # machine-readable change specification
rollback_spec: dict # how to undo the change
created_at: str # ISO 8601
status: str = "proposed" # proposed -> applied -> rolled_back | rejected
Key Methods¶
| Method | Purpose |
|---|---|
analyze() |
Run all engines and return deduplicated, ranked recommendations |
2.4 How Learn Feeds Back Into Planning¶
The learn subsystem's outputs are consumed by the engine's planner:
-
Learned patterns inform agent selection and sequencing. When the planner creates a new plan, it queries
PatternLearner.recommend_sequencing()to find the best agent sequence for the task type. -
Budget recommendations inform token allocation. The planner can query
BudgetTuner.load_recommendations()to right-size budget tiers. -
Knowledge gaps (from
PatternLearner.knowledge_gaps_for()) tell the planner which knowledge packs to attach to agents that previously lacked context. -
Retrospective feedback (via
RetrospectiveFeedback) providesagents_to_drop()andagents_to_prefer()sets that the planner uses to adjust agent routing.
2.5 Honest Current State and Learning Cycle Pipeline¶
Last updated: 2026-04-17
This section documents what each learn-subsystem component actually does today, which parts are deprecated, and how the replacement pipeline works.
Component Status¶
PerformanceScorer — Working. Computes per-agent first_pass_rate,
retry_rate, gate_pass_rate, negative_mentions, and knowledge_gaps_cited
by reading usage-log.jsonl or the SQLite backend. These are rate calculations
across all historical records with no statistical significance test: an agent
with 3 uses and 2 failures shows a 67% failure rate with the same weight as
an agent with 50 uses and 33 failures.
PatternLearner — Working. Groups completed tasks by sequencing_mode and
identifies the most common agent combination per group. The evidence_strength
score (formerly called "confidence") is a heuristic ramp, not a statistically
validated confidence interval. Minimum sample requirement: min_sample_size
records (default 3) before a group is considered.
BudgetTuner — Working. Reads usage records and compares actual token
consumption to the nominal midpoint of the current budget tier. Suggestions
are output to budget-recommendations.json — they are proposals, not
decisions.
ImprovementLoop — Working. Calls TriggerEvaluator, then Recommender
(which aggregates PerformanceScorer, PatternLearner, BudgetTuner outputs),
classifies each recommendation as auto-applicable or escalated, and persists
an ImprovementReport. Auto-apply guardrails: prompt changes never auto-apply;
budget upgrades never auto-apply; routing reductions never auto-apply; all
others only if risk == "low" and evidence_strength >= threshold.
LearnedOverrides — Working. Persists operational corrections (agent flavor
mismatches, gate command adjustments) to learned-overrides.json. Applied
automatically on the next execution without going through the full pipeline.
ExperimentManager (agent_baton/core/improve/experiments.py) —
DEPRECATED. The module recorded before/after metrics but never ran
concurrent A/B groups. Do not add new callers. The replacement is
before/after scorecard comparison across learning cycles (see Pipeline
below).
PromptEvolutionEngine (agent_baton/core/improve/evolution.py) —
DEPRECATED. Was a template-based rule engine that matched scorecard
thresholds to canned suggestion strings. Suggestions were generic, not
derived from actual retrospective content. Do not add new callers. The
replacement is the learning-analyst agent (D2).
Evidence Strength vs. Confidence¶
Across the codebase, fields previously named confidence have been
relabeled to evidence_strength (or equivalent) to avoid implying
statistical validation. The formula is a heuristic:
The default threshold for acting on a pattern is 0.35 (reduced from 0.7). In the CLI, patterns are displayed as "Seen N times, X% success rate" rather than as a numeric confidence score.
The Learning Cycle Pipeline¶
The learning system is being refactored from inline analysis code into a repeatable baton execution plan template at:
| Phase | Agent | What It Does |
|---|---|---|
| COLLECT | test-engineer | SQL queries gather scorecards, pattern data, retrospective summaries, and knowledge gaps from the last N executions. Writes a data bundle to the team-context directory. |
| ANALYZE | learning-analyst | Reads the collected data bundle. Identifies patterns and failures with specific evidence citations. |
| PROPOSE | learning-analyst | Outputs actionable recommendations. Each proposal names the specific agent, the specific failure mode, and the specific change. |
| REVIEW | (APPROVAL gate) | Human reviews proposals before anything is applied. |
| APPLY | backend-engineer | Writes approved changes to learned-overrides.json, agent definition files, or knowledge packs. |
| DOCUMENT | documentation-architect | Records what changed, why, and what outcome is expected as a retrospective entry. |
Operational corrections (routing mismatches, gate command fixes) bypass
the pipeline and go directly to the LearnedOverrides auto-apply path.
Triggering a Learning Cycle¶
Manual trigger:
baton learn run-cycle # create the plan and print it
baton learn run-cycle --run # create and execute immediately
baton learn run-cycle --dry-run # print the baton execute run command without executing
Counter-based trigger:
TriggerEvaluator tracks completed executions since the last learning
cycle. When the count reaches the threshold (configurable via
LEARNING_TRIGGER_COUNT env var, default 10), baton execute status
reports:
The cycle does NOT run automatically — this is a flag, not an auto-trigger. The operator or daemon runs it explicitly.
Capabilities Summary¶
| Capability | Status | Notes |
|---|---|---|
| Track per-agent success/failure rates | Working | PerformanceScorer |
| Identify common agent sequences | Working | PatternLearner |
| Suggest budget tier adjustments | Working | BudgetTuner |
| Auto-apply safe operational corrections | Working | LearnedOverrides |
| Human-reviewed proposal pipeline | Working | ProposalManager |
| Statistical significance tests | Not present | Not planned |
| Concurrent A/B experiments | Not present | ExperimentManager deprecated |
| Template-based prompt suggestions | Deprecated | PromptEvolutionEngine deprecated |
| Agent-driven retrospective analysis | In progress | learning-analyst agent (D2) |
| Repeatable learning cycle template | In progress | templates/learning-cycle-plan.json (D2) |
3. Improve Subsystem¶
Package: agent_baton.core.improve
The improve layer acts on the patterns and recommendations produced by the learn layer. It scores agents, proposes prompt changes, manages agent version control, runs experiments, and handles rollbacks.
3.1 Performance Scorer¶
Module: agent_baton/core/improve/scoring.py
Classes: PerformanceScorer, AgentScorecard
The performance scorer computes per-agent health assessments by combining quantitative metrics from usage logs with qualitative signals from retrospectives.
AgentScorecard¶
@dataclass
class AgentScorecard:
agent_name: str
times_used: int
first_pass_rate: float # % of uses with 0 retries
retry_rate: float # average retries per use
gate_pass_rate: float | None # % of gates passed
total_estimated_tokens: int
avg_tokens_per_use: int
models_used: dict[str, int]
# Qualitative signals from retrospectives
positive_mentions: int
negative_mentions: int
knowledge_gaps_cited: int
Health Rating¶
The health property applies a simple decision tree:
| Condition | Health |
|---|---|
times_used == 0 |
"unused" |
first_pass_rate >= 0.8 and negative_mentions == 0 |
"strong" |
first_pass_rate >= 0.5 |
"adequate" |
| Otherwise | "needs-improvement" |
Qualitative Signal Extraction¶
The scorer scans retrospective markdown for agent name mentions within specific sections:
- Mentions in "What Worked" sections count as positive.
- Mentions in "What Didn't" sections count as negative.
- Mentions in "Knowledge Gaps" sections count as knowledge gaps cited.
When a StorageBackend is provided, retrospectives are read from the
database rather than the filesystem, ensuring SQLite-mode projects return
current data.
Trend Detection¶
detect_trends(agent_name, window) uses linear regression on the
agent's binary first-pass success (1 if zero retries, 0 otherwise) over
the last N tasks:
| Slope | Trend |
|---|---|
| > 0.02 | "improving" |
| < -0.02 | "degrading" |
| Otherwise | "stable" |
Key Methods¶
| Method | Purpose |
|---|---|
score_agent(agent_name) |
Compute a scorecard for a single agent |
score_all() |
Compute scorecards for all agents in usage logs |
generate_report() |
Generate a full markdown scorecard report |
detect_trends(agent_name, window) |
Detect performance trend over recent tasks |
write_report(path) |
Write the scorecard report to disk |
Storage¶
Reports are written to:
3.2 Prompt Evolution Engine¶
Module: agent_baton/core/improve/evolution.py
Classes: PromptEvolutionEngine, EvolutionProposal
Status: Experimental -- built and tested but not yet validated with real usage data.
The prompt evolution engine analyses agent performance and generates structured proposals for prompt improvements. It identifies underperforming agents and suggests specific changes based on their failure patterns.
EvolutionProposal¶
@dataclass
class EvolutionProposal:
agent_name: str
scorecard: AgentScorecard
issues: list[str] # problems identified
suggestions: list[str] # proposed changes
priority: str # "high" (needs-improvement) or "normal" (adequate)
timestamp: str
Issue Detection Rules¶
| Condition | Issue | Suggestion |
|---|---|---|
first_pass_rate < 0.5 |
Low first-pass rate | Add specific failure mode instructions; include negative examples |
first_pass_rate < 0.8 |
Moderate first-pass rate | Review retry patterns in retrospectives |
retry_rate > 1.0 |
High retry rate | Tighten acceptance criteria |
gate_pass_rate < 0.7 |
Low gate pass rate | Add quality checklist to prompt |
negative_mentions > 0 |
Negative retrospective mentions | Address specific failures from "What Didn't Work" |
knowledge_gaps_cited > 0 |
Knowledge gaps | Create/update knowledge packs; add "Before Starting" section |
Workflow¶
analyze()-- Score all agents, identify underperformers, generate proposals sorted by priority (high first, then by first-pass rate ascending).propose_for_agent(name)-- Generate a proposal for a specific agent.save_proposals(proposals)-- Write proposals as markdown files.generate_report()/write_report()-- Summary report.
Storage¶
Proposals are written to:
Reports are written to:
3.3 Agent Version Control¶
Module: agent_baton/core/improve/vcs.py
Classes: AgentVersionControl, ChangelogEntry
The VCS system tracks changes to agent definition files with timestamped backups and a changelog. It enables safe prompt evolution by ensuring every change is reversible.
ChangelogEntry¶
@dataclass
class ChangelogEntry:
timestamp: str # ISO 8601
agent_name: str
action: str # "created", "modified", "archived"
summary: str # what changed and why
backup_path: str # path to .bak file if modified
Backup Strategy¶
Backups are stored at:
The changelog is stored at:
New entries are prepended after the header so the most recent change appears first.
Key Methods¶
| Method | Purpose |
|---|---|
backup_agent(agent_path) |
Create a timestamped backup before modification |
log_change(entry) |
Append a changelog entry |
read_changelog() |
Parse and return all changelog entries |
get_agent_history(agent_name) |
Return changelog entries for a specific agent |
list_backups(agent_name) |
List backup files, optionally filtered by agent |
restore_backup(backup_path, target_path) |
Restore a backup (creates safety backup first) |
track_modification(agent_path, summary) |
Back up + log change (primary API) |
track_creation(agent_path, summary) |
Log creation of a new agent |
3.4 Improvement Loop¶
Module: agent_baton/core/improve/loop.py
Class: ImprovementLoop
Model: agent_baton/models/improvement.py -- ImprovementReport,
ImprovementConfig
The improvement loop is the closed-loop orchestrator that wires together all the subsystems. It checks triggers, generates recommendations, classifies them, auto-applies safe ones, creates experiments to track impact, and escalates risky ones to humans.
ImprovementLoop Cycle¶
run_cycle(force=False)
|
+-- Check circuit breaker (3+ rollbacks in 7 days -> skip)
|
+-- Check triggers (enough new data since last analysis?)
|
+-- Detect anomalies (failure rates, gate failures, budget overruns)
|
+-- Generate recommendations (via Recommender.analyze())
|
+-- Persist all recommendations (via ProposalManager)
|
+-- Classify each recommendation:
| |
| +-- Auto-apply? (low risk + high confidence + permitted category)
| | -> Apply and create an experiment to track impact
| |
| +-- Escalate? (high risk, or prompt change, or low confidence)
| -> Record for human review
|
+-- Evaluate running experiments (auto-rollback degraded ones)
|
+-- Mark analysis as done (update trigger watermark)
|
+-- Save and return ImprovementReport
Auto-Apply Classification¶
A recommendation is auto-applied only when ALL of these conditions are met:
rec.category != "agent_prompt"(prompt changes NEVER auto-apply)rec.auto_applicable == True(set by the Recommender's guardrails)rec.risk == "low"rec.confidence >= config.auto_apply_threshold(default 0.8)
ImprovementReport Model¶
@dataclass
class ImprovementReport:
report_id: str
timestamp: str
skipped: bool # True if cycle was skipped
reason: str # why skipped, if applicable
anomalies: list[dict] # detected anomalies
recommendations: list[dict] # all recommendations generated
auto_applied: list[str] # rec_ids that were auto-applied
escalated: list[str] # rec_ids escalated for human review
active_experiments: list[str] # experiment_ids created this cycle
ImprovementConfig Model¶
@dataclass
class ImprovementConfig:
auto_apply_threshold: float = 0.8 # minimum confidence for auto-apply
paused: bool = False # manual pause switch
Storage¶
Reports are stored at:
Key Methods¶
| Method | Purpose |
|---|---|
run_cycle(force) |
Run a complete improvement cycle |
evaluate_experiments() |
Evaluate all running experiments |
load_reports() |
Load all improvement reports |
3.5 Trigger Evaluator¶
Module: agent_baton/core/improve/triggers.py
Class: TriggerEvaluator
Model: agent_baton/models/improvement.py -- TriggerConfig, Anomaly
The trigger evaluator decides when to run the improvement pipeline and detects system anomalies.
Analysis Triggers¶
Analysis runs when both conditions are met:
- At least
min_tasks_before_analysis(default 10) total tasks exist. - At least
analysis_interval_tasks(default 5) new tasks have been recorded since the last analysis.
TriggerConfig¶
@dataclass
class TriggerConfig:
min_tasks_before_analysis: int = 10
analysis_interval_tasks: int = 5
agent_failure_threshold: float = 0.3
gate_failure_threshold: float = 0.2
budget_deviation_threshold: float = 0.5
confidence_threshold: float = 0.7
Anomaly Detection¶
detect_anomalies() scans usage data for four types of anomalies:
| Anomaly Type | Condition | Severity |
|---|---|---|
high_failure_rate |
Agent failure rate > 30% | high if > 50%, else medium |
retry_spike |
Average retries > 2.0 per agent | medium |
high_gate_failure_rate |
Gate failure rate > 20% | high if > 40%, else medium |
budget_overrun |
Token usage > 50% above tier midpoint | medium |
Anomaly Model¶
@dataclass
class Anomaly:
anomaly_type: str # see table above
severity: str # "low", "medium", "high"
agent_name: str # which agent, if applicable
metric: str # metric that triggered the anomaly
current_value: float # actual value observed
threshold: float # threshold that was exceeded
sample_size: int # data points that contributed
evidence: list[str] # human-readable evidence strings
Trigger State¶
The trigger evaluator persists its watermark (the task count at the time of the last analysis) to:
3.6 Experiment Manager¶
Module: agent_baton/core/improve/experiments.py
Class: ExperimentManager
Model: agent_baton/models/improvement.py -- Experiment
The experiment manager tracks the impact of applied recommendations by creating experiments that compare post-change performance against a baseline.
Experiment Constraints¶
- Maximum 2 active experiments per agent.
- Minimum 5 samples before evaluation.
- Default maximum duration: 14 days.
Experiment Model¶
@dataclass
class Experiment:
experiment_id: str
recommendation_id: str
hypothesis: str # what we expect to happen
metric: str # "first_pass_rate", "gate_pass_rate", etc.
baseline_value: float # pre-change metric value
target_value: float # expected improvement (baseline * 1.05)
agent_name: str
started_at: str
min_samples: int = 5
max_duration_days: int = 14
status: str = "running" # running -> concluded | rolled_back
samples: list[float] # observed metric values
result: str = "" # "improved", "degraded", "inconclusive"
Evaluation Thresholds¶
| Condition | Result |
|---|---|
| Average sample > baseline + 5% | "improved" |
| Average sample < baseline - 5% | "degraded" |
| Otherwise | "inconclusive" |
Experiments that degrade are automatically rolled back by the improvement loop without human approval.
Storage¶
Experiments are stored as individual JSON files at:
Key Methods¶
| Method | Purpose |
|---|---|
create_experiment(recommendation, metric, ...) |
Create a new experiment |
record_sample(experiment_id, value) |
Add a new observation |
evaluate(experiment_id) |
Evaluate against baseline |
conclude(experiment_id, result) |
Manually conclude |
mark_rolled_back(experiment_id) |
Mark as rolled back |
active() |
List all running experiments |
active_for_agent(agent_name) |
Running experiments for a specific agent |
3.7 Proposal Manager¶
Module: agent_baton/core/improve/proposals.py
Class: ProposalManager
The proposal manager persists the lifecycle of improvement recommendations as an append-only JSONL log.
Status Transitions¶
Storage¶
Recommendations are stored at:
Key Methods¶
| Method | Purpose |
|---|---|
record(rec) |
Append a recommendation to the log |
record_many(recs) |
Append multiple recommendations |
update_status(rec_id, new_status) |
Transition a recommendation's status |
load_all() |
Load all recommendations from disk |
get(rec_id) |
Get a specific recommendation |
get_by_status(status) |
Filter by status |
get_applied() |
Get all applied recommendations |
get_proposed() |
Get all proposed (pending) recommendations |
3.8 Rollback Manager¶
Module: agent_baton/core/improve/rollback.py
Class: RollbackManager
The rollback manager restores agents to their pre-experiment state and implements a circuit breaker to prevent cascading auto-apply failures.
Rollback Behavior¶
- For
agent_promptchanges: restores the latest backup viaAgentVersionControl. - For other categories (
budget_tier,routing,sequencing): logs therollback_specfor the caller to act on. - Rollback is always automatic on experiment degradation -- no human approval needed.
Circuit Breaker¶
If 3 or more rollbacks occur within a 7-day window, the circuit breaker trips and all auto-apply is paused. This prevents the system from repeatedly applying and rolling back changes when there are systemic issues.
Storage¶
Rollback audit entries are stored at:
Key Methods¶
| Method | Purpose |
|---|---|
rollback(recommendation, reason) |
Execute a rollback and log it |
circuit_breaker_tripped() |
Check if 3+ rollbacks in 7 days |
recent_rollbacks(days) |
Return rollbacks from the last N days |
load_all() |
Load all rollback entries |
3.9 Improve CLI Commands¶
| Command | Description |
|---|---|
baton scores |
Show all agent performance scorecards |
baton scores --agent NAME |
Show scorecard for a specific agent |
baton scores --trends |
Show performance trends for all agents |
baton scores --write |
Write scorecard report to disk |
baton evolve |
Show prompt evolution report |
baton evolve --agent NAME |
Show proposal for a specific agent |
baton evolve --save |
Write proposals to disk |
baton evolve --write |
Write summary report to disk |
baton patterns |
Show all learned patterns |
baton patterns --refresh |
Re-analyse usage log and update patterns |
baton patterns --task-type TYPE |
Show patterns for a specific task type |
baton patterns --recommendations |
Show sequencing recommendations |
baton patterns --min-confidence N |
Filter by minimum confidence |
baton budget |
Show saved budget recommendations |
baton budget --recommend |
Analyse usage log and display recommendations |
baton budget --save |
Save recommendations to disk |
baton budget --auto-apply |
Show only auto-applicable downgrades |
baton changelog |
Show agent changelog entries |
baton changelog --agent NAME |
Show history for a specific agent |
baton changelog --backups [NAME] |
List backup files |
baton improve |
Show latest improvement report |
baton improve --run |
Run a full improvement cycle |
baton improve --force |
Force-run cycle (skip trigger check) |
baton improve --experiments |
Show active experiments |
baton improve --history |
Show all improvement reports |
baton anomalies |
Detect and display system anomalies |
baton anomalies --watch |
Show trigger readiness and anomaly status |
baton experiment list |
List all experiments |
baton experiment show ID |
Show experiment details |
baton experiment conclude ID --result RESULT |
Manually conclude an experiment |
baton experiment rollback ID |
Roll back an experiment |
4. The Learning Loop (End-to-End)¶
The complete learning loop flows through all three subsystems:
EXECUTION OBSERVE LEARN IMPROVE
--------- ------- ----- -------
Orchestrator TraceRecorder PatternLearner PerformanceScorer
dispatches agents ------> records events groups by task type scores each agent
| | |
v v v
UsageLogger computes confidence PromptEvolutionEngine
logs tokens, and success rate proposes changes
retries, gates | |
| v v
v BudgetTuner AgentVersionControl
Retrospective recommends tier backs up + tracks
Engine writes adjustments changes
what worked, | |
what didn't, v v
knowledge gaps Recommender ImprovementLoop
| unifies all classifies:
v recommendations auto-apply or escalate
ContextProfiler | |
measures read/ | v
write efficiency | ExperimentManager
| | tracks impact
v | |
AgentTelemetry | v
logs tool calls +-------------------> RollbackManager
auto-rollback if
degraded; circuit
breaker if 3+ in 7d
Step-by-Step Flow¶
-
Execution: The orchestrator dispatches agents to complete a task. Each dispatch, gate check, and completion is recorded.
-
Trace:
TraceRecordercaptures a timestamped event DAG for the entire task lifecycle. -
Usage:
UsageLoggerappends aTaskUsageRecordwith per-agent token counts, retry counts, gate results, and the final outcome. -
Retrospective:
RetrospectiveEnginegenerates a structured retrospective with what worked, what didn't, knowledge gaps, roster recommendations, and sequencing notes. -
Context Profile:
ContextProfileranalyses the trace to compute per-agent context efficiency (read/write ratio, redundancy). -
Telemetry:
AgentTelemetrycaptures fine-grained tool-call events. -
Pattern Learning:
PatternLearnerreads the usage log, discovers recurring success patterns, and persists them. The planner uses these patterns when creating future plans. -
Budget Tuning:
BudgetTuneranalyses token usage distributions and recommends tier adjustments. -
Scoring:
PerformanceScorercomputes per-agent scorecards combining quantitative metrics with qualitative retrospective signals. -
Evolution:
PromptEvolutionEngineidentifies underperforming agents and proposes specific prompt improvements. -
Recommendation:
Recommenderunifies all recommendations, deduplicates, and ranks by confidence and risk. -
Improvement Loop:
ImprovementLoop.run_cycle()classifies each recommendation as auto-apply or escalate, creates experiments for auto-applied changes, and evaluates running experiments. -
Rollback: If an experiment shows degradation (>5% loss from baseline),
RollbackManagerautomatically restores the previous state. If 3+ rollbacks occur in 7 days, the circuit breaker trips and all auto-apply is paused. -
Better Next Execution: The planner queries learned patterns, budget recommendations, knowledge gaps, and retrospective feedback when creating the next plan, completing the loop.
5. Data Models Summary¶
Observe Models¶
| Model | Module | Purpose |
|---|---|---|
TraceEvent |
models/trace.py |
Single timestamped execution event |
TaskTrace |
models/trace.py |
Complete trace for an orchestrated task |
AgentUsageRecord |
models/usage.py |
Single agent's usage within a task |
TaskUsageRecord |
models/usage.py |
Full task usage record |
TelemetryEvent |
core/observe/telemetry.py |
Single agent tool-call event |
AgentContextProfile |
models/context_profile.py |
Per-agent context efficiency |
TaskContextProfile |
models/context_profile.py |
Task-level context efficiency |
Retrospective |
models/retrospective.py |
Structured post-task analysis |
AgentOutcome |
models/retrospective.py |
Per-agent result in a retrospective |
KnowledgeGap |
models/retrospective.py |
Knowledge gap (legacy schema) |
RosterRecommendation |
models/retrospective.py |
Agent roster change suggestion |
SequencingNote |
models/retrospective.py |
Phase-level sequencing observation |
RetrospectiveFeedback |
models/feedback.py |
Aggregated actionable feedback for the planner |
Learn Models¶
| Model | Module | Purpose |
|---|---|---|
LearnedPattern |
models/pattern.py |
Recurring orchestration pattern |
BudgetRecommendation |
models/budget.py |
Budget tier adjustment recommendation |
Improve Models¶
| Model | Module | Purpose |
|---|---|---|
AgentScorecard |
core/improve/scoring.py |
Per-agent performance scorecard |
EvolutionProposal |
core/improve/evolution.py |
Proposed prompt change |
ChangelogEntry |
core/improve/vcs.py |
Agent change audit record |
Recommendation |
models/improvement.py |
Unified improvement recommendation |
Experiment |
models/improvement.py |
Tracks impact of an applied recommendation |
Anomaly |
models/improvement.py |
Detected system anomaly |
TriggerConfig |
models/improvement.py |
When to run improvement analysis |
ImprovementReport |
models/improvement.py |
Summary of an improvement cycle |
ImprovementConfig |
models/improvement.py |
Top-level improvement loop config |
RollbackEntry |
core/improve/rollback.py |
Rollback audit record |
6. Configuration¶
TriggerConfig (when to analyze)¶
| Parameter | Default | Description |
|---|---|---|
min_tasks_before_analysis |
10 | Minimum total tasks before first analysis |
analysis_interval_tasks |
5 | New tasks needed to trigger re-analysis |
agent_failure_threshold |
0.3 | Failure rate above which an anomaly is flagged |
gate_failure_threshold |
0.2 | Gate failure rate above which an anomaly is flagged |
budget_deviation_threshold |
0.5 | Token deviation above which a budget overrun is flagged |
confidence_threshold |
0.7 | Minimum confidence for recommendations |
ImprovementConfig (how to apply)¶
| Parameter | Default | Description |
|---|---|---|
auto_apply_threshold |
0.8 | Minimum confidence for auto-application |
paused |
False |
Manual pause switch for auto-apply |
PatternLearner Thresholds¶
| Parameter | Default | Description |
|---|---|---|
min_sample_size |
5 | Minimum records per group to generate a pattern |
min_confidence |
0.7 | Minimum confidence to include a pattern |
| Confidence calibration | 15 | Sample size divisor in confidence formula |
BudgetTuner Thresholds¶
| Parameter | Default | Description |
|---|---|---|
| Minimum sample size | 3 | Records per group before generating recommendations |
| Upgrade threshold | 80% | Median > 80% of tier ceiling triggers upgrade |
| Auto-apply confidence | 0.8 | Minimum confidence for auto-apply (downgrades only) |
ExperimentManager Constraints¶
| Parameter | Default | Description |
|---|---|---|
| Max active per agent | 2 | Maximum concurrent experiments per agent |
| Min samples | 5 | Minimum observations before evaluation |
| Max duration | 14 days | Maximum experiment runtime |
| Improvement threshold | +5% | Gain needed to declare "improved" |
| Degradation threshold | -5% | Loss needed to declare "degraded" |
Circuit Breaker (RollbackManager)¶
| Parameter | Default | Description |
|---|---|---|
| Rollback count | 3 | Rollbacks within the window that trip the breaker |
| Window | 7 days | Time window for counting rollbacks |
Data Archiver Retention¶
| Parameter | Default | Description |
|---|---|---|
| Retention period | 90 days | Files older than this are eligible for cleanup |
| JSONL rotation | 10,000 lines | Maximum lines kept in rotated JSONL files |
7. File Layout Summary¶
.claude/team-context/
traces/ # TraceRecorder output
<task_id>.json
usage-log.jsonl # UsageLogger output
usage-dashboard.md # DashboardGenerator output
telemetry.jsonl # AgentTelemetry output
retrospectives/ # RetrospectiveEngine output
<task_id>.md # human-readable
<task_id>.json # machine-readable sidecar
context-profiles/ # ContextProfiler output
<task_id>.json
learned-patterns.json # PatternLearner output
budget-recommendations.json # BudgetTuner output
agent-scorecards.md # PerformanceScorer output
evolution-report.md # PromptEvolutionEngine output
evolution-proposals/ # PromptEvolutionEngine proposals
<agent_name>.md
improvement-trigger-state.json # TriggerEvaluator watermark
improvements/ # ImprovementLoop output
recommendations.jsonl # ProposalManager log
rollbacks.jsonl # RollbackManager log
reports/ # ImprovementLoop reports
<report_id>.json
experiments/ # ExperimentManager data
<experiment_id>.json
agents/
.backups/ # AgentVersionControl backups
<agent>.<timestamp>.md
changelog.md # AgentVersionControl changelog
8. Practical Examples¶
View the current state of your system¶
# How are my agents performing?
baton scores
# Any anomalies?
baton anomalies
# What happened in the last execution?
baton trace --last
# How efficient was the context usage?
baton context-profile --generate my-task-id
baton context-profile my-task-id
Run the learning pipeline¶
# Discover patterns from execution history
baton patterns --refresh
# Check if budget tiers need adjustment
baton budget --recommend
# See what the system would auto-apply vs. escalate
baton improve --force
Investigate a specific agent¶
# Full scorecard
baton scores --agent architect
# Usage stats
baton usage --agent architect
# Performance trend
baton scores --trends
# Context efficiency
baton context-profile --agent architect
# Any evolution proposals?
baton evolve --agent architect
# Changelog history
baton changelog --agent architect
Review and manage experiments¶
# What experiments are running?
baton experiment list
# Check details
baton experiment show exp-abc12345
# Manually conclude if you have enough signal
baton experiment conclude exp-abc12345 --result improved
# Roll back if something went wrong
baton experiment rollback exp-abc12345
Clean up old data¶
# Preview what would be removed without deleting anything
baton cleanup --retention-days 60 --dry-run
# Actually clean up files older than 60 days
baton cleanup --retention-days 60