Architecture Decision Records¶
These records document the rationale behind significant structural and
architectural decisions in the agent_baton package. ADRs 01-10 originate
from the 2026-03-23 re-architecture. Later ADRs document decisions made as
the system matured. Future contributors should consult this document before
proposing changes that would reverse these decisions.
ADR-01: Remove the Dual Plan Model Hierarchy¶
Decision: Delete ExecutionPlan, Phase, AgentAssignment, QAGate,
and PlanBuilder; make MachinePlan the sole plan type.
Context: Two parallel plan hierarchies existed -- models/plan.py
(ExecutionPlan) from Epic 1 and models/execution.py (MachinePlan) from
Epic 2. The engine, runtime, and CLI all used MachinePlan exclusively.
ExecutionPlan was constructed only by PlanBuilder and consumed only by
ContextManager.write_plan(), which called plan.to_markdown() -- a method
MachinePlan already implements. The dual hierarchy caused contributor
confusion because the structurally similar names (Phase vs PlanPhase,
ExecutionPlan vs MachinePlan) gave no signal about which was active.
MissionLogEntry from models/plan.py was retained and moved to a dedicated
models/mission_log.py.
Status: Implemented (2026-03-23)
ADR-02: Remove the Backward-Compatibility Shim Layer¶
Decision: Delete all 21 core/*.py shim files and migrate every consumer
to canonical sub-package import paths.
Context: When Epic 2 reorganized core/ from a flat layout into
sub-packages, 21 single-line re-export shims were left at core/*.py so
existing imports would not break. All CLI commands and pre-reorganization tests
imported from these paths. Internal core code migrated to canonical paths
immediately. The shims imposed a cognitive cost: browsing agent_baton/core/
showed 22 flat files before the actual sub-packages, every class had two valid
import paths, and core/__init__.py re-exported 48 symbols with no documented
rationale. The migration was entirely mechanical -- find-and-replace of import
strings -- with Python's import errors providing an immediate safety net.
Status: Implemented (2026-03-23)
ADR-03: Formalize the Worker-Engine Contract as a Protocol¶
Decision: Introduce ExecutionDriver in core/engine/protocols.py as a
typing.Protocol (runtime-checkable) that specifies the interface TaskWorker
and WorkerSupervisor use when calling the engine.
Context: TaskWorker called ExecutionEngine through eight distinct
methods (start, next_action, next_actions, mark_dispatched,
record_step_result, record_gate_result, complete, status). This was the
most critical runtime contract in the system, yet it was not declared anywhere.
There was no way to inject a mock engine into tests without subclassing
ExecutionEngine or monkey-patching. The Protocol approach requires no changes
to ExecutionEngine (structural typing satisfies it automatically) and allows
alternative engine implementations in tests and future integrations.
Status: Implemented (2026-03-23)
ADR-04: Assign Exclusive Event Ownership to Prevent Duplicate Publishing¶
Decision: ExecutionEngine owns task.*, phase.*, and gate.* events.
TaskWorker owns step.* events. Neither publishes the other's topics.
Context: Both ExecutionEngine.record_step_result() and
TaskWorker._execution_loop() published step.dispatched, step.completed,
and step.failed to the same EventBus. EventPersistence, auto-wired as a
subscriber in the engine, wrote every event to a JSONL file. Each step
completion produced two identical JSONL records, causing projections (e.g.,
project_task_view) to silently double-count step activity. The fix assigns
authoritative ownership: the engine publishes lifecycle transitions it knows
about (task boundaries, phase transitions, gate results); the worker publishes
the step-level events it witnesses directly (dispatch, completion, failure).
Status: Implemented (2026-03-23)
ADR-05: Extract StatePersistence from ExecutionEngine¶
Decision: Extract state I/O from executor.py into a StatePersistence
class in core/engine/persistence.py. ExecutionEngine accepts an optional
StatePersistence instance.
Context: At 878 lines, executor.py mixed four concerns: state machine
logic (~400 LOC), state persistence (~150 LOC), observability wiring (~150
LOC), and utilities (~180 LOC). The state machine could not be unit-tested
without real filesystem paths, because _save_state and _load_state were
private methods on the same class. Extracting StatePersistence as an
injectable dependency lets tests construct an engine with an in-memory state
provider, testing _determine_action without disk I/O.
Status: Implemented (2026-03-23)
ADR-06: Document Core vs Peripheral Layering Explicitly¶
Decision: Reduce core/__init__.py to 3 canonical re-exports
(AgentRegistry, AgentRouter, ContextManager) and update
agent_baton/__init__.py to expose the primary execution surface directly from
canonical paths.
Context: All 9 sub-packages sat at the same directory level with no
indication of which formed the primary execution path. agent_baton/__init__.py
exposed only PlanBuilder and ContextManager -- not ExecutionEngine or
TaskWorker -- inverting the package's actual usage pattern. The package-level
exports now reflect what users actually need: ExecutionEngine, TaskWorker,
MachinePlan, AgentLauncher, and the orchestration types. The dependency
hierarchy (models -> events/observe/govern -> engine -> runtime -> CLI) is
documented in core/__init__.py.
Status: Implemented (2026-03-23)
ADR-07: Gate Experimental distribute Modules Behind a Subdirectory¶
Decision: Move core/distribute/async_dispatch.py, incident.py, and
transfer.py into core/distribute/experimental/. Retain packager.py and
registry_client.py at the top level as production modules.
Context: Three modules in core/distribute/ were scaffolding-level
implementations not exercised in production execution paths. They coexisted
with production-ready packager.py and registry_client.py at the same level.
Contributors could not distinguish production from experimental modules without
reading each file. The experimental/ subdirectory is a clear, searchable
signal. Any future caller that imports from core.distribute.experimental
knows explicitly what it is getting.
Status: Implemented (2026-03-23)
ADR-08: Group CLI Commands into Domain Subdirectories¶
Decision: Reorganize the 35 flat cli/commands/*.py files into 6
subdirectory groups: execution/, observe/, govern/, improve/,
distribute/, agents/. Update cli/main.py auto-discovery to scan one level
of subdirectories.
Context: A flat directory of 35 files with no organizational structure made
it difficult to locate related commands or understand which domain a new command
should live in. The cli/main.py auto-discovery pattern
(pkgutil.iter_modules) already registered commands by their internal
register() call, not by filename -- so moving files into subdirectories
preserves all registered subcommand strings. The extended discovery scans both
the flat commands/ directory and any non-underscore subdirectory one level
deep.
Status: Implemented (2026-03-23)
ADR-09: Normalize Enum Usage to Typed Instances¶
Decision: Store ActionType, StepStatus, and PhaseStatus as typed
enum instances in dataclass fields. Serialize to .value strings only in
to_dict() methods.
Context: ActionType, StepStatus, and PhaseStatus were declared as
Enum classes in models/execution.py but ExecutionAction.action_type was
typed as str and populated with .value strings throughout. Comparisons used
the .value form (action.action_type == ActionType.DISPATCH.value), which
IDE type-checkers could not validate. The inconsistency with models/enums.py
(which used typed enum instances internally) made the codebase harder to
navigate. The normalized pattern stores enum instances in fields and reserves
string conversion for the serialization boundary, matching the pattern
established in models/enums.py.
Status: Implemented (2026-03-23)
ADR-10: Replace Implicit EventBus Auto-Wiring with an Explicit Factory¶
Decision: Introduce ExecutionContext in core/runtime/context.py as a
factory that explicitly wires EventBus and EventPersistence together.
ExecutionEngine accepts the context object as an alternative to a bare bus
parameter.
Context: When an EventBus was passed to ExecutionEngine.__init__, the
constructor silently subscribed an EventPersistence instance to it. There was
no way to pass a bus without also triggering persistence, and constructing
engine + bus + persistence separately risked event routing silently missing the
persistence subscriber. The ExecutionContext.build(events_dir, persist=True)
classmethod makes the wiring explicit and testable: pass persist=False for
tests, persist=True for production. Callers using the old bus= parameter
continue to work unchanged.
Status: Implemented (2026-03-23)
ADR-11: Knowledge Delivery via Layered Pipeline (Registry -> Resolver -> Dispatcher)¶
Decision: Implement knowledge delivery as a three-component pipeline:
KnowledgeRegistry (in core/orchestration/) for loading and indexing
packs, KnowledgeResolver (in core/engine/) for matching and budgeting,
and PromptDispatcher injection for prompt assembly. Runtime gaps are
handled by core/engine/knowledge_gap.py with a self-interrupt/re-dispatch
protocol. Feedback flows through the existing core/observe/retrospective.py
and core/learn/pattern_learner.py subsystems.
Context: Knowledge packs existed on disk but were never consumed by the execution engine. Agents received generic shared context with no targeted domain knowledge, wasting the value of specialist agents. Additionally, agents had no mechanism to recognize or signal knowledge gaps -- they guessed rather than requesting help.
Alternatives considered:
-
Unified knowledge service: A single class handling loading, matching, and delivery. Rejected because it conflates three different concerns (disk I/O and indexing, planning-time resolution, and prompt assembly) that operate at different lifecycle stages and have different testing requirements.
-
Event-driven delivery: Agents request knowledge via events during execution rather than receiving it at dispatch time. Rejected because it adds asynchronous complexity without benefit -- the agent's knowledge needs are largely predictable from the task description, and the plan review gate lets users correct mistakes before execution starts.
-
Global knowledge injection: Attach all relevant packs to every step. Rejected because it causes context rot -- agents receive irrelevant information that dilutes their focus and wastes context window budget.
Key trade-offs:
-
Hybrid matching (tags + TF-IDF): Strict tag/keyword matching is used first because it is deterministic and auditable. TF-IDF relevance fallback activates only when strict matching returns nothing. This preserves predictability for well-tagged packs while remaining useful for projects without curated metadata. An MCP RAG server replaces TF-IDF when available.
-
Runtime self-interruption: Agents self-interrupt via a
KNOWLEDGE_GAPsignal rather than the executor polling for gaps between steps. This fits the stateless agent model -- agents terminate cleanly, the executor handles escalation, and re-dispatch is a standard plan amendment. The alternative (streaming gap detection mid-execution) would require stateful agent connections that the current architecture does not support. -
Intervention levels: A plan-level
--intervention low|medium|highflag shifts the escalation matrix thresholds.low(default) maximizes agent autonomy;highescalates on any unresolved gap. This lets users tune the human-in-the-loop ratio per task without changing agent definitions or system configuration. -
Feedback via existing subsystems:
KnowledgeGapRecordentries are written into retrospective JSON files thatPatternLearneralready reads. No new storage mechanism. This means the feedback loop activates automatically once the retrospective and learn subsystems are in use. -
KnowledgeGapmodel replacement:models/retrospective.pyhad aKnowledgeGapdataclass. The newKnowledgeGapRecordis a strict superset (adds resolution tracking, gap typing, and task context). The old model is replaced rather than extended to avoid a permanent dual-model situation analogous to ADR-01. Old retrospective JSON files are handled byfrom_dict()defaulting the new fields.
Status: Implemented (2026-03-24)
ADR-12: Federated Sync as a Central Read Replica (Not Central-Write-Through)¶
Decision: Per-project baton.db files remain the sole write target for
execution. A central database at ~/.baton/central.db is a read replica
populated exclusively by a one-way incremental sync mechanism (SyncEngine).
The existing pmo.db is merged into central.db rather than kept separate.
Sync uses row-level watermarks rather than file-level copy. External source
data (ADO, Jira, GitHub) is ingested directly into central.db via the
ExternalSourceAdapter protocol.
Context: Agent Baton can be used across multiple projects simultaneously.
The PMO scanner, performance scoring, and knowledge gap feedback loops all
benefit from cross-project data, but there was no aggregation layer. Each
project wrote to its own baton.db in .claude/team-context/, and the PMO
scanned individual project directories at query time. With N projects this
became O(N) filesystem reads per PMO status request.
Alternatives considered:
-
Central-write-through (all projects write directly to
central.db): Rejected because it creates a single point of failure -- ifcentral.dbis corrupted or on a slow filesystem, every execution in every project blocks. It also causes SQLite write-lock contention when multiple Claude sessions run simultaneously across projects. The replica approach means projects always work offline;central.dbis rebuildable by re-runningbaton sync --all. -
SQLite ATTACH for cross-project queries (attach all project
baton.dbfiles at query time): Rejected because SQLite limits ATTACH to 10 databases by default, requires all project files to be accessible at the same time (fails for remote/archived projects), and provides no place for external source mappings or PMO tables that span projects. -
Separate central database per concern (keep
pmo.dbseparate, add a third file for external sources): Rejected because it multiplies the number of files to manage, introduces cross-file joins that SQLite handles poorly, and gives the PMO scanner N+2 databases to query instead of one. The merged schema has a slightly larger DDL, which is an acceptable cost. -
File-level copy (copy the entire
baton.dbtocentral.dbon sync): Rejected because it is O(total rows), not incremental. With large project histories, every sync re-copies the entire database. Row-level sync with watermarks is O(delta rows since last sync), idempotent on retry, and naturally deduplicates on concurrent syncs. -
Event-sourcing sync (replay the
eventstable to build central projections): Rejected because not all data is event-sourced -- telemetry, retrospectives, learned patterns, and budget recommendations are direct writes that have no corresponding events. Row-level sync covers all tables uniformly with the same algorithm.
Key trade-offs:
-
Row-level vs. file-level sync: Row-level sync with watermarks is O(delta), idempotent, and handles concurrent syncs gracefully via SQLite's WAL mode + busy_timeout. File-level copy is simpler to implement but O(n) on total data and requires merge logic when two sessions race.
-
Merging pmo.db into central.db: The PMO tables (
projects,programs,signals,archived_cards,forge_sessions,pmo_metrics) are already global -- they describe all projects, not any one. Keeping them in a separate file requires cross-file joins for every PMO query. Merging them eliminates the join and reduces the filesystem footprint to one central file. The cost is a one-time migration and a slightly richer schema. -
Auto-sync on
baton execute complete: Sync fires automatically after every execution completes, keepingcentral.dbfresh without manual intervention. The hook is wrapped in a best-efforttry/exceptso sync failure never blocks execution completion. If auto-sync is too slow (threshold: 2s), it logs a warning and returns; the user can runbaton syncmanually.
ADR-13: PAIR_DISPATCH — Sequential Two-Agent Adversarial Assessment¶
Decision: Add a PAIR_DISPATCH action type that runs two agents
sequentially on the same step, where the second agent receives the first
agent's output as input context for adversarial review or validation.
Status: Proposed (2026-04-03) — not yet implemented.
Context: During a multi-domain audit, an agent needed to pair a visualization-expert with a subject-matter-expert per domain — one to assess dashboard quality, the other to validate domain accuracy. Claude Code dispatches single agents, and TEAM_DISPATCH (which runs members in parallel) does not support one agent reviewing another's output. The workaround was a single agent wearing both hats, which produced consensus rather than the desired creative tension.
Proposed design:
- Action type:
PAIR_DISPATCHemitted bybaton execute nextwhen a step has exactly two team members and the step'smodefield is"pair". - Wire format (extends
_print_actionprotocol):ACTION: PAIR_DISPATCH Primary Agent: <agent_a> Review Agent: <agent_b> Step: <step_id> Mode: adversarial | validating | complementary Message: <description> --- Primary Prompt --- <prompt for agent_a> --- End Primary Prompt --- --- Review Prompt --- <prompt for agent_b, includes placeholder {primary_output}> --- End Review Prompt --- - Orchestrator behavior: The orchestrator spawns agent A, captures its
full output, then spawns agent B with the review prompt where
{primary_output}is replaced by agent A's actual output. Agent B's prompt includes the instruction: "Review and challenge the following assessment from {agent_a}. Identify disagreements, gaps, and risks." - Recording: The orchestrator calls
baton execute team-recordtwice (once per member). The step is complete when both members finish. - Planner integration:
_consolidate_team_stepgains amodeparameter. When the task description contains adversarial/validation signals ("review", "validate", "challenge", "audit", "cross-check"), the mode is set to"pair"and the executor emitsPAIR_DISPATCHinstead of parallelDISPATCHactions.
Alternatives considered:
-
Sequential DISPATCH with manual context threading: The orchestrator already captures agent output for
--outcome. The second agent could be dispatched with the first agent's outcome pasted into its prompt. This works today with no engine changes but requires the orchestrator agent to implement the threading logic, which is fragile and not discoverable. PAIR_DISPATCH makes the pattern explicit and engine-driven. -
Parallel DISPATCH with shared artifact: Both agents run simultaneously and write to a shared file. Rejected because the review agent needs the primary agent's complete output before it can assess — parallel execution defeats the purpose of adversarial review.
-
Three-agent pattern (primary + reviewer + synthesizer): Adds a third agent to reconcile disagreements. Deferred as over-engineering for the initial implementation; can be added later as
PANEL_DISPATCH.
Key trade-offs:
-
New action type vs. reusing DISPATCH: Adding a new action type is a breaking change to the
_print_actionprotocol (seedocs/invariants.md). All orchestrator agent definitions must be updated to handle it. The benefit is that the pairing semantics are explicit in the wire format rather than buried in prompt engineering. -
Engine-driven vs. orchestrator-driven: Making the engine emit PAIR_DISPATCH centralizes the pattern and makes it available to headless execution (
baton execute run). The alternative — teaching the orchestrator agent to manually chain two dispatches — only works in interactive Claude Code sessions.
Implementation order:
- Add
modefield toTeamMemberandPlanStepmodels. - Add
PAIR_DISPATCHtoActionTypeenum inmodels/execution.py. - Implement
_pair_dispatch_actioninexecutor.py. - Update
_print_actioninexecute.pywith the new wire format. - Update
_consolidate_team_stepinplanner.pyto set mode. - Update orchestrator agent definition to handle
PAIR_DISPATCH. - Update
docs/invariants.mdwith the new protocol contract. -
Add integration tests covering the full pair-dispatch loop.
-
ExternalSourceAdapter as a Protocol: External integrations (ADO, Jira, GitHub, Linear) have heterogeneous APIs but a uniform normalized output (
ExternalItem). Thetyping.Protocolapproach means new adapters can be added without modifying any central code -- they self-register viaAdapterRegistry.register()on import. This also allows third-party adapters without subclassing. -
PAT not stored in DB: The ADO adapter reads the Personal Access Token from an environment variable whose name is stored in the
configJSON column ofexternal_sources. This means PAT rotation only requires updating the environment variable -- no database writes, no migration.
Status: Implemented (2026-03-24)
ADR-13: Native SQLite Bead Memory (Not Beads Go CLI Backend)¶
Decision: Implement structured agent memory as native SQLite tables in
baton.db (schema v4), adopting the concepts from Steve Yegge's Beads agent
memory system (beads-ai/beads-cli) but not the Go binary, Dolt backend, or
.beads/ file format. Agents emit inline signals (BEAD_DISCOVERY,
BEAD_DECISION, BEAD_WARNING) that are parsed and persisted by the
executor, following the same pattern as the existing knowledge gap protocol.
Context: Beads (18.7k GitHub stars) introduced a compelling model for
agent memory -- hash-based IDs, typed dependency graphs, memory decay, and
a bd ready command for surfacing unblocked work items. Gastown (built on
Beads for 20-160+ concurrent agents) demonstrated that shared context reduces
token burn 60-80% and that design decisions become the execution bottleneck.
However, Beads requires a Go binary (breaks pip install), uses Dolt for
storage (heavy dependency, 3 storage migrations in 4 months), and solves
concurrent-write contention that Agent Baton doesn't have (serialized executor).
Alternatives considered:
-
Adopt Beads Go CLI as runtime dependency: Rejected. Requires Go toolchain or prebuilt binaries, breaks the "pip install and go" developer experience. Beads had 3 storage backend migrations in 4 months (flatfile -> sqlite -> Dolt), indicating an unstable storage layer.
-
Adopt Dolt as storage backend: Rejected. Dolt adds a 200MB+ binary dependency, requires a running server process, and is designed for concurrent multi-writer scenarios. Agent Baton's serialized executor has no concurrent-write problem -- SQLite WAL mode is sufficient.
-
Import Beads as a Python library (wrap the Go CLI): Rejected. FFI wrappers add brittleness, version coupling, and complicate debugging. The signal parsing and storage patterns are simple enough to implement natively in ~1000 lines of Python.
-
Use
.beads/file format for interoperability: Deferred as a pull-forward feature. TheExternalSourceAdapterprotocol can bridge to.beads/directories when/if interop is needed, without changing the core engine.
Key trade-offs:
-
Native SQLite vs. Beads backend: SQLite is already the storage layer for all other engine data. Using it for beads means one database, one sync pipeline, one backup story. The cost is implementing bead CRUD from scratch (~435 LOC for
BeadStore), which is modest. -
Signal protocol vs. structured API: Agents emit signals as free-text markers (like
KNOWLEDGE_GAP), not via a structured API. This works because agents write to stdout -- there is no function-call interface. Regex parsing is fragile in theory but robust in practice (the knowledge gap protocol has been running reliably since v2). -
Schema v4 migration shared across DB types: The
MIGRATIONSdict is applied to both project and central databases. FK constraints that reference single-column PKs must be omitted from migrations because central tables use composite PKs. Fresh project DBs get FKs fromPROJECT_SCHEMA_DDLdirectly. -
Pull-forward hedge: The
ExternalSourceAdapterprotocol andStorageBackendprotocol preserve the option to integrate with the Beads ecosystem later without changing the core engine. If Beads stabilizes its storage layer, aBeadsAdaptercould bridge the gap. -
Tiers 2-4 extensions: Tiers 2-4 added
BeadSelector(forward relay into delegation prompts, ranked by dependency chain > same-phase > cross-phase),BeadAnalyzer(warning frequency, file clustering, decision reversal detection for plan enrichment), memory decay (auto-archive old closed beads), knowledge gap auto-resolution from high-confidence beads, bead-to-knowledge promotion (baton beads promote), cross-project analytics views incentral.db, conflict detection oncontradicts/supersedeslinks, and quality scoring withBEAD_FEEDBACKsignals (schema v6 addsquality_scoreandretrieval_countcolumns).
Status: Implemented -- all four tiers (2026-04-13)
ADR-14: Learning Automation System (Hybrid Ledger + Improvement Pipeline)¶
Decision: Build a closed-loop learning system using a new
LearningLedger (SQLite-backed issue tracker) that feeds into the
existing improvement pipeline (proposals, experiments, rollback).
Context: The system collected rich execution data (retrospectives, scores, patterns, recommendations) but had three gaps: no structured issue tracking, no systematized workflow, and no auto-application path. Routing mismatches, agent degradations, and knowledge gaps required manual discovery across scattered files.
Alternatives considered:
- (A) Ledger-only: New
LearningLedgeras single source of truth, replacing the improvement pipeline. Rejected -- duplicates existing safety mechanisms (experiments, rollback, circuit breaker). - (B) Pipeline extension: Extend
ImprovementLoop->Recommender->ProposalManagerwithout a new storage layer. Rejected -- proposals lack lifecycle tracking, evidence accumulation, and cross-issue correlation. - (C) Hybrid (chosen): New
LearningLedgerfor structured issue tracking and evidence accumulation, feedingRecommendationobjects into the existing proposal/experiment/rollback machinery.
Key trade-offs:
- Auto-application via overrides file: Corrections are persisted to
learned-overrides.jsonrather than modifying Python source. This is reversible (delete the file), portable (per-project), and doesn't require code changes. Consumed byAgentRouter.route()andIntelligentPlannerat call time. - Graduated auto-apply thresholds: Each issue type has its own
occurrence threshold (routing=3, degradation=5, gate=2). Interview-
only types (
pattern_drift,prompt_evolution) never auto-apply. - Federation:
learning_issuestable mirrors tocentral.dbviaSyncEngine, enabling cross-project pattern detection. - Structured interviews:
LearningInterviewerpresents issues as multiple-choice dialogues, recording decisions back to the ledger. This replaces ad-hoc retrospective reading with a directed workflow.
Status: Implemented (2026-04-13)
ADR-15: SQLite as Primary Store (Replacing JSON Flat Files)¶
Decision: Make SQLite (baton.db) the primary persistence layer for all
execution data, replacing the original JSON/JSONL flat-file storage. Implement
a StorageBackend protocol with two implementations: SqliteStorage (default
for all new projects) and FileStorage (retained for backward compatibility).
Auto-detection at startup selects the backend based on filesystem state.
Context: The original storage was a collection of JSON and JSONL files --
execution-state.json, usage.jsonl, retrospectives.jsonl,
learned-patterns.json, and others -- each managed by the component that owned
it. This approach worked for single-session development but became problematic
at scale: (1) concurrent reads during CLI queries and daemon execution caused
partial-read corruption, (2) JSONL append-only files grew without bound and
required full sequential scans for any query, (3) cross-entity queries (e.g.,
"show all step results for task X with their gate outcomes") required loading
and joining multiple files in Python, and (4) the flat-file approach consumed
context window budget when Claude sessions tried to read execution state.
Alternatives considered:
-
Keep JSON files with file-level locking: Rejected. File locking in Python (
fcntl.flock) is not portable to Windows, does not protect against partial writes on crash, and still requires full-file reads for any query. -
Use a document database (TinyDB, MongoDB): Rejected. TinyDB uses JSON files internally (same problems) and MongoDB adds a server process dependency that contradicts the "pip install and go" philosophy.
-
PostgreSQL / MySQL: Rejected. Requires a running server, configuration, connection strings. SQLite is zero-config, file-based, and embeds in the Python process.
Key trade-offs:
-
Transactional writes:
SqliteStoragewraps every public write method in an implicit transaction (with conn:). Multi-table upserts (e.g., saving a plan with its phases, steps, and team members) are atomic -- partial writes from crashes are eliminated. -
DELETE-then-INSERT for child collections: Step results, gate results, and retrospective outcomes use DELETE-then-INSERT rather than UPDATE to avoid stale rows when list items are removed. This is a deliberate trade-off: slightly more I/O per write, but simpler code with no stale data.
-
WAL mode for concurrent access: All connections use WAL journal mode so that CLI queries (
baton query) do not block the execution engine.ConnectionManageruses one connection per thread viathreading.localstorage. -
Dual-write transition period: During migration, the engine writes to both SQLite and JSON files so that older CLI versions can still read state. The dual-write is marked
TODO:T4for removal once the flat-file code paths are fully deprecated. -
Auto-detection:
detect_backend()incore/storage/__init__.pychecks forbaton.db(SQLite), thenexecution-state.json(file), defaulting to SQLite for new projects. This means zero configuration for users -- existing projects keep working, new projects get SQLite automatically.
Status: Implemented (2026-03-26). JSON flat-file dual-writes remain as transitional. Full deprecation pending (tracked as T4).
ADR-16: Synchronous In-Process Event Bus (Not Message Queue)¶
Decision: Implement the event system as a synchronous, in-process pub/sub
bus (EventBus in core/events/bus.py) with fnmatch-style glob topic
routing. Handlers execute inline during publish(). Persistence is handled
by an EventPersistence subscriber that writes append-only JSONL. No threads,
no queues, no external message broker.
Context: The execution engine needed an event backbone for four purposes: (1) decoupling the engine from observability (traces, usage, dashboards), (2) enabling webhook delivery to external systems, (3) supporting SSE streaming to the PMO UI, and (4) providing crash-recovery replay via persisted event logs. A traditional message queue (Redis, RabbitMQ, Kafka) would satisfy all four, but Agent Baton runs in a single Python process per project with no server infrastructure requirement.
Alternatives considered:
-
Redis pub/sub or RabbitMQ: Rejected. Adds an external server dependency, configuration, and failure modes. Agent Baton's deployment model is "install a pip package and run CLI commands" -- requiring a message broker would be a non-starter for most users.
-
asyncio.Queue-based bus: Rejected for the core bus because the engine is synchronous. An async bus would require
await publish()in the engine's hot path, threading async through the entire state machine. The SSE endpoint bridges the sync bus to async via anasyncio.Queueadapter at the API boundary -- this isolates the complexity to one adapter. -
Threading + queue: Rejected. Concurrent handler execution introduces non-determinism and makes debugging harder. The execution engine is deterministic by design -- events should be too.
Key trade-offs:
-
Synchronous dispatch: Handlers run inline during
publish(). If a handler raises, the exception propagates to the publisher. This keeps execution deterministic and debuggable, but means a slow handler blocks the engine. In practice, handlers are fast (write a JSONL line, enqueue a webhook delivery, append to in-memory history). -
Monotonic sequencing: Each task gets an auto-incrementing sequence counter. Events arrive with
sequence == 0and the bus assigns the next number, providing a total order per task without external coordination. -
In-memory history + file persistence: The bus retains all published events in memory for replay queries.
EventPersistencesubscribes to*and writes each event to a JSONL file. This gives both fast in-memory access for projections and durable storage for post-hoc analysis. -
Projections as fold functions:
project_task_view()folds an event stream into materializedTaskView/PhaseView/StepViewdataclasses. This is a pure function over the event log -- no state stored in a database, no ORM, just a sequential scan of events. Simple to test, simple to debug.
Status: Implemented (2026-03-23, with ADR-04 and ADR-10 refining ownership and wiring)
ADR-17: FastAPI Service Layer as Optional API (Not Required Infrastructure)¶
Decision: Add a FastAPI HTTP API (agent_baton/api/) as an optional
[api] dependency group. The API wraps the same core classes the CLI uses
(via api/deps.py singleton injection) and adds SSE event streaming, webhook
delivery with HMAC signing, and REST endpoints for plans, executions, agents,
decisions, and the PMO board. The CLI remains the primary interface and works
without the API installed.
Context: The PMO UI needed a backend, daemon mode needed a way to expose decision requests to external operators, and external tools (CI systems, Slack bots, monitoring dashboards) needed programmatic access to execution state. Building a separate microservice would duplicate all business logic. Embedding the API in the same package ensures it always uses the same engine version and data models.
Alternatives considered:
-
MCP server only (Model Context Protocol): Rejected as the sole interface because MCP is designed for LLM tool use, not for human-facing UIs or webhook delivery. The PMO React app needs standard REST + SSE.
-
Separate FastAPI microservice: Rejected because it would duplicate models, engine logic, and storage access code. It would also require a deployment story (Docker, port configuration, process management) that contradicts the "pip install and go" philosophy.
-
Flask or Django: Rejected. FastAPI provides native async support (needed for SSE streaming), automatic OpenAPI docs, Pydantic request/ response validation, and dependency injection -- all of which reduce boilerplate compared to Flask.
Key trade-offs:
-
Optional dependency group:
pip install agent-baton[api]installs FastAPI, Uvicorn, and SSE-starlette. The base package has no web framework dependency. Route modules are imported lazily insidecreate_app()so that a missing optional dependency raisesImportErroronly if the route is actually used. -
Singleton DI via module-level variables:
api/deps.pystores singletons as_privatemodule-level variables initialized byinit_dependencies(). This avoids a DI container while keeping FastAPI'sDepends()declarations readable in route files. -
Single shared EventBus: All API components share one
EventBusinstance. SSE streaming and webhook delivery observe all engine events regardless of which component emitted them. The SSE adapter bridges the synchronous bus to async consumers via anasyncio.Queue. -
baton serve+baton daemon start --serve: The API can run standalone (baton serve) or combined with daemon execution (baton daemon start --serve). Combined mode starts both the worker loop and the API server in the same process, sharing the event bus.
Status: Implemented (2026-03-23)
ADR-18: PMO as Kanban Board with Forge Plan Generation¶
Decision: Build the Portfolio Management Office (PMO) as a three-component
subsystem: PmoScanner (reads execution state across projects and produces
Kanban cards), PmoStore/PmoSqliteStore (persists projects, programs,
signals, and archived cards), and ForgeSession (consultative plan creation
via headless Claude). The PMO UI is a React/Vite SPA served at /pmo/ by the
FastAPI server.
Context: Managing multiple concurrent agent executions across projects
required a unified view of what was queued, in progress, awaiting human input,
and completed. Without this, the operator had to run baton status in each
project directory individually. The PMO also needed to bridge the gap between
"I have a signal (bug report, feature request, incident)" and "I have an
execution plan" -- this is the Forge.
Alternatives considered:
-
CLI-only portfolio view (
baton pmo statusas a table): Implemented as a baseline, but insufficient for at-a-glance triage of 10+ concurrent plans. A Kanban board provides spatial organization (columns = workflow state) that a text table cannot. -
External tool integration only (pipe to Jira/ADO, view there): Rejected as the primary view because it adds a hard dependency on an external system and introduces sync lag. External sources are supported via the
ExternalSourceAdapterprotocol (ADR-12) as supplementary data, not as the primary management surface. -
Server-rendered HTML: Rejected in favor of a React SPA because the Forge's interactive plan editing, drag-and-drop priority management, and real-time SSE updates require rich client-side state management.
Key trade-offs:
-
PmoScanner auto-detects storage backend: For each registered project, the scanner calls
detect_backend()and reads execution state from either SQLite or legacy JSON files. This means the PMO works across projects at different migration stages without configuration. -
ForgeSession uses headless Claude: Plan generation calls
claude --printviaHeadlessClaudeto produce LLM-quality plans. When the Claude CLI is unavailable, it falls back to the rule-basedIntelligentPlanner. This gives the best available quality without requiring an API key -- it uses the same authentication as the user's Claude Code session. -
Interview-driven refinement: After generating an initial plan, the Forge produces 3-5 structured questions about ambiguities (missing tests, no gates, high risk, multi-agent coordination). The user's answers are fed back to regenerate an improved plan. This catches planning errors before execution starts rather than during.
-
Signal triage flow: PMO signals (production incidents, bug reports, feature requests) can be triaged directly into execution plans via
ForgeSession.triage_signal(). This connects the "something happened" observation to the "here is what we will do about it" plan in a single workflow.
Status: Implemented (2026-03-26, with ongoing UX improvements)
ADR-19: Headless Claude Execution via claude --print Subprocess¶
Decision: Implement autonomous execution via HeadlessClaude, a
subprocess wrapper around claude --print that enables plan generation and
agent dispatch without an active Claude Code interactive session. Used by
ForgeSession for LLM plan generation, baton execute run for autonomous
execution loops, and the PMO execute endpoint for UI-driven launches.
Context: Two capabilities required Claude interaction without a human session: (1) the Forge needed LLM-quality plan generation from the PMO UI, and (2) full autonomous execution needed to run a plan end-to-end without an orchestrator agent driving the loop interactively. Both cases required programmatic access to Claude in a non-interactive mode.
Alternatives considered:
-
Anthropic API SDK directly: Rejected because it requires an API key separate from the user's Claude Code subscription, introduces billing complexity, and loses the environment context (CLAUDE.md, project files, agent definitions) that
claude --printinherits from the project directory. -
MCP tool call: Rejected because MCP is designed for Claude calling tools, not for tools calling Claude. The control flow is inverted.
-
Persistent Claude Code session via stdin/stdout: Rejected because Claude Code's interactive mode is designed for human conversation, not programmatic use. Its output format is not stable for machine parsing.
--printmode with--output-format jsonprovides a stable JSON response contract.
Key trade-offs:
-
Subprocess isolation: Each headless invocation is a fresh subprocess with a sanitized environment (only allowlisted env vars pass through:
ANTHROPIC_API_KEY,CLAUDE_CODE_USE_BEDROCK,CLAUDE_CODE_USE_VERTEX,AWS_PROFILE,AWS_REGION). API keys are scrubbed from error logs via regex. This provides security isolation but adds per-invocation overhead (~2-5s startup). -
Retry with exponential backoff:
max_retries=2withbase_retry_delay=5.0s. Rate-limited or transient failures are handled automatically. The caller receives aHeadlessResultwithsuccess=Falseafter exhausting retries rather than an exception. -
Large prompt handling: Prompts exceeding 128KB are sent via stdin rather than the
-pflag to avoid shell argument-length limits. -
baton execute runfull loop: CombinesHeadlessClaudewith the execution engine to drive the complete start -> dispatch -> gate -> complete cycle. Each agent dispatch spawns a freshclaude --printsubprocess with the delegation prompt. No Claude Code session required.
Status: Implemented (2026-03-26)
ADR-20: Haiku Classifier for Adaptive Plan Sizing¶
Decision: Introduce a TaskClassifier protocol with two implementations:
HaikuClassifier (calls Claude Haiku via claude --print for intelligent
classification) and KeywordClassifier (deterministic fallback using keyword
heuristics and registry-aware agent scoring). The planner uses
FallbackClassifier which tries Haiku first and degrades to keywords.
Classification output (TaskClassification) determines task type, complexity
tier, agent roster, and phase sequence.
Context: The planner originally used hardcoded heuristics to determine plan structure -- every task got a similar number of phases and agents regardless of actual complexity. Simple bug fixes received the same multi-phase plan as complex cross-cutting refactors, wasting tokens and human review time. A "fix a typo" task should not generate a 5-phase plan with 4 agents.
Alternatives considered:
-
Always use LLM classification: Rejected because it adds latency and cost to every
baton planinvocation. The keyword classifier handles the common case (obvious task types like "fix bug X" or "add feature Y") without any API call. LLM classification activates only for ambiguous or complex descriptions. -
User-specified complexity: Rejected as the primary mechanism because users consistently underestimate complexity. Auto-classification with user override (via
--complexity light|medium|heavy) is the chosen pattern. -
Single classifier, no protocol: Rejected because it prevents testing the planner independently of the LLM. The protocol allows injecting a deterministic classifier in tests.
Key trade-offs:
-
Three complexity tiers:
light(1 agent, 1-2 phases),medium(up to 3 agents, 2-4 phases),heavy(up to 5 agents, 4+ phases). Agent roster caps (_MAX_AGENTS_BY_COMPLEXITY) prevent bloated plans regardless of how many agents the classifier suggests. -
Registry-aware scoring: The
KeywordClassifierscores agents from theAgentRegistryby keyword overlap with the task summary, category affinity, and preferred-primary-implementer rules. This means classification adapts to the available agents -- a project with custom agents gets appropriate routing. -
Word-boundary matching: Keyword scoring uses
\bregex boundaries rather than substring matching. This prevents false positives like "fix" matching inside "prefix" or "test" inside "latest". Multi-word keywords use substring matching since they are specific enough. -
Haiku model choice: Claude Haiku was chosen for classification because it is fast (~1-2s), cheap, and sufficient for structured classification tasks. Using Sonnet or Opus for classification would add unnecessary latency and cost. The
--modelflag onHaikuClassifierallows overriding if needed.
Status: Implemented (2026-03-26)
ADR-21: Daemon Mode with Team Collaboration and Async Decisions¶
Decision: Implement daemon-mode execution via WorkerSupervisor (PID
file management, structured logging, signal handling) wrapping TaskWorker
(async execution loop). Add team collaboration infrastructure: SynthesisSpec
on PlanStep for team output coordination, conflict detection via file-overlap
heuristics, DecisionManager for file-based async human decisions, and
ContributionRequest for multi-party input collection. MCP server pass-through
is selective per step.
Context: Interactive orchestration via the Claude Code session works well for attended development, but two scenarios required unattended execution: (1) long-running plans (30+ minutes) where the developer wants to step away, and (2) multi-developer workflows where different team members handle different steps asynchronously. The daemon needs to run the full execution loop, handle gates and approvals without a human present, and allow out-of-band interaction when human input is required.
Alternatives considered:
-
Background shell process (
nohup baton execute run &): Rejected because it provides no PID management, no structured logging, no graceful shutdown, and no status querying. The developer has no way to check progress or intervene without parsing raw stdout. -
Systemd service: Rejected because it requires root access for installation, is Linux-only, and is too heavy for per-project per-execution lifecycle management. The supervisor handles the same concerns in user-space.
-
Central daemon managing all projects: Rejected because it creates a single point of failure across projects. Per-project supervisors are independent -- one project's daemon crash does not affect others.
Key trade-offs:
-
PID file with flock:
WorkerSupervisorusesflock()on the PID file to prevent duplicate daemons. The OS releases the lock automatically when the process exits, eliminating the stale-PID-file race condition common with naive PID file approaches. On Windows,msvcrt.lockingprovides equivalent behavior. -
File-based decision protocol:
DecisionManagerwrites decision requests as JSON files + companion.mdsummaries. Operators resolve decisions from a separate CLI session (baton decide --resolve). This avoids requiring a running server or shared database for human interaction -- files are the universal interface. -
Team synthesis strategies:
PlanStep.synthesis_specsupports three strategies:concatenate(append outputs),merge_files(git-merge file changes), andagent_synthesis(dispatch a synthesis agent to combine outputs). The strategy is selected at plan time based on step type and team size. -
Conflict escalation: When team members modify overlapping files,
_detect_team_conflict()produces aConflictRecord. The default resolution is escalation to an APPROVAL gate;auto_mergeis available for low-risk conflicts. Conflicts flow through to retrospectives for learning. -
Selective MCP pass-through:
PlanStep.mcp_serversallows specific MCP servers to be forwarded to specific agents. This prevents agents from accessing tools they do not need while enabling those that do (e.g., a data-engineer step with database MCP access). -
Namespaced execution directories: When
task_idis provided, all daemon files (worker.pid,worker.log,worker-status.json) are stored underexecutions/<task_id>/, enabling concurrent daemon executions within the same project.
Status: Implemented (2026-03-29, Phases 1-5)
ADR-22: Foresight Engine — Proactive Gap Analysis During Plan Creation¶
Decision: Add a ForesightEngine that scans planned phases and steps
during create_plan() and inserts preparatory phases when it predicts
capability gaps, missing prerequisites, or edge cases.
Context: Users frequently describe tasks at the outcome level ("build a data quality pipeline") without specifying prerequisites that agents will need ("the data quality agent needs delete/drop capability for duplicates"). This gap surfaced repeatedly in retrospectives as knowledge-gap signals after execution had already started — wasting budget on interrupted steps. The foresight engine addresses this proactively at plan-time instead of reactively at execution-time, before any agents are dispatched.
Mechanism: The engine evaluates a set of ForesightRule objects against
step descriptions and agent assignments. Each rule has trigger keywords,
trigger agents, a confidence score, and a resolution template. When a rule
matches, a preparatory phase is inserted before the triggering phase. Rules
fire at most once (deduplicated by rule_id). Confidence thresholds scale
inversely with risk level — higher-risk plans lower the threshold so more
rules fire. Each insertion is recorded as a ForesightInsight on the plan.
Integration point: Step 9.7 in create_plan(), after phase enrichment
and knowledge resolution (9.5-9.6), before score checks and budget (10-11).
A second knowledge resolution pass (9.8) runs on foresight-inserted steps
so they receive the same knowledge attachments as other steps.
Key trade-offs:
-
Heuristic rules vs. LLM analysis: Heuristic rules are fast (no API call), deterministic, and testable. LLM analysis would catch more subtle gaps but adds latency and cost to every
baton planinvocation. The engine is designed to accept custom rules, so a future LLM-based rule source can be added without changing the architecture. -
Insert-before vs. amend-during: Foresight inserts phases at plan-time rather than amending during execution. This keeps plans self-contained and reviewable — users see the full plan including prerequisites before execution starts.
Status: Implemented (2026-03-29)
ADR-23: Planning Taxonomy — Universal Semantic Object Model¶
Decision: Codify a formal taxonomy (models/taxonomy.py,
references/planning-taxonomy.md) that defines the semantic vocabulary for
all planning concepts: step intents, phase archetypes, plan element kinds,
foresight insight categories, and agent-intent affinity mappings.
Context: The planning system had implicit classifications scattered
across constants (_PHASE_VERBS, _PHASE_IDEAL_ROLES, _STEP_TEMPLATES)
without a unified model that named what each element means. This made it
hard for agents to reason about their role in a plan and for users to
understand why the planner made specific choices. The taxonomy provides
machine-readable definitions (enums, dataclasses) and human-readable
documentation (reference doc, knowledge pack) from a single source.
Key concepts:
StepIntent— 13 intents grouped into 4 families (core work, support, governance, foresight) that describe why a step exists.PhaseArchetype— 7 archetypes that map free-form phase names to canonical lifecycle roles (discovery > design > preparation > implementation > verification > review > remediation).PlanElementKind— type discriminator for every structural element in the plan hierarchy.AGENT_INTENT_AFFINITY— maps each agent to the intents it naturally fulfils.
Delivery: PlanningTaxonomy class provides to_dict() for structured
data, to_markdown() for human rendering, and per-vocabulary query methods
(step_intents(), phase_archetypes(), plan_elements()). A knowledge
pack (planning-taxonomy) and reference doc (planning-taxonomy.md)
deliver the taxonomy to agents and users respectively.
Status: Implemented (2026-03-29)