# Resilience and Task Recovery System ## Overview Marcus runs many autonomous agents in parallel, each working on its own git worktree. Agents can die in many ways: the tmux pane is killed, a network connection drops, the process crashes, or the host machine reboots. When that happens, the tasks those agents held must be detected, released, and handed off to other agents so work continues — without losing the committed progress the dead agent already made. The **Resilience and Task Recovery System** is how Marcus does that. It uses **lease-based liveness detection** with **cadence-aware false-positive prevention**, a **worktree-aware handoff protocol**, and an **in-memory state cleanup callback** to safely return recovered tasks to the assignment pool. This document describes the final implementation landed on the `feature/resilience-wiring-cleanup` branch. ## Design Goals 1. **Detect dead agents quickly** — seconds to minutes, not hours. 2. **Minimize false positives** — don't recover tasks from agents that are simply slow. Slow is not dead. 3. **Preserve committed work** — if the dead agent made real progress and committed it to their branch, the next agent should build on it, not restart from scratch. 4. **Stay loosely coupled** — agents don't need to know about leases or send explicit heartbeats; any MCP tool call proves they're alive. 5. **Match Marcus's board-mediated pattern** — no WebSockets, no bespoke heartbeat protocol. Polling plus a board as the source of truth. ## Architecture at a Glance ``` ┌─────────────────────────────────────────────────────────────────────┐ │ MarcusServer │ │ │ │ ┌─────────────────────┐ ┌──────────────────────────────┐ │ │ │ AssignmentLease- │◄─────┤ LeaseMonitor (asyncio task) │ │ │ │ Manager │ │ polls every 60s │ │ │ │ │ └──────────────────────────────┘ │ │ │ active_leases │ │ │ │ on_recovery_ │──────┐ │ │ │ callback │ │ cleans agent_tasks, │ │ └──────────┬──────────┘ │ tasks_being_assigned │ │ │ ▼ │ │ │ ┌──────────────────┐ │ │ │ │ state (server) │ │ │ │ │ agent_tasks{} │ │ │ │ │ tasks_being_ │ │ │ │ │ assigned{} │ │ │ │ └──────────────────┘ │ │ │ │ │ │ touch_lease() on every MCP tool call │ │ │ │ │ ┌──────────▼──────────┐ │ │ │ handlers.py │ │ │ │ (MCP tool dispatch) │◄──── agents call tools │ │ └─────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌───────────────┐ │ Kanban Board │ │ (source of │ │ truth for │ │ task state) │ └───────────────┘ ``` ## Key Components ### 1. AssignmentLeaseManager **File**: `src/core/assignment_lease.py` The lease manager tracks a lease for every in-progress task assignment. A lease is a lightweight record with: - `agent_id` — which agent holds the task - `task_id` — the task being leased - `assigned_at` — when it was first handed out - `lease_expires` — when the lease would expire without renewal - `renewal_count` — how many times it has been renewed - `progress_percentage` — the last known progress - Update history — timestamps used to compute the agent's median update interval Three important methods drive lease lifecycle: - `touch_lease(agent_id)` — a cheap extension. Called on any MCP tool activity from the agent. Does not require progress data. - `renew_lease(task_id, progress)` — a full renewal with progress data. Called when the agent explicitly reports progress. - `recover_expired_lease(lease)` — resets the task to `TODO`, clears `assigned_to`, builds a `RecoveryInfo` object, dual-writes to the board, and invokes `on_recovery_callback`. #### Progressive Timeout Phases Instead of a single fixed timeout, Marcus uses progressive timeouts that match where the task is in its lifecycle. A task that has not yet produced a first progress update is treated very differently from one that is 80% complete. | Phase | Trigger | Lease | Grace | Total | Rationale | |-------|---------|-------|-------|-------|-----------| | 1. Unproven | 0 updates | 60s | 20s | 80s | Detect startup failures fast. | | 2. Working | 1 update | 90s | 30s | 120s | Agent is alive, be moderate. | | 3. Proven | 25–75% progress | 120s | 30s | 150s | Protect in-flight work. | | 4. Finishing | >75% progress | 60s | 15s | 75s | Detect final stalls quickly. | ### 2. LeaseMonitor **File**: `src/core/assignment_lease.py` A background `asyncio` task that wakes up every 60 seconds and walks `active_leases`, calling `check_expired_leases`. For each expired lease it calls `should_recover_expired_lease` (the cadence-aware check) and, if the check says recover, calls `recover_expired_lease`. **Critical detail — event loop affinity**: the `LeaseMonitor` must run on `uvicorn`'s event loop, not on whatever loop happened to exist at server setup time. The HTTP transport starts its own loop for each request context, and a monitor created during setup will be bound to the wrong loop and never fire. To solve this, the server exposes `ensure_lease_monitor_running()` and the first call to `request_next_task` (handled on the correct loop) lazily starts the monitor. ```python # src/marcus_mcp/tools/task.py if hasattr(state, "ensure_lease_monitor_running"): await state.ensure_lease_monitor_running() ``` ### 3. Cadence-Based Recovery **File**: `src/core/assignment_lease.py` — `should_recover_expired_lease` Fixed timeouts produce false positives for agents that naturally update on a slower cadence (e.g., a research-heavy task with long think time). Rather than asking "has the timeout expired?", Marcus asks "is this silence abnormal for **this specific agent**?" The algorithm: 1. Compute the agent's median interval between progress updates. 2. Compare the current silence (time since last update) to `median_interval * silence_multiplier`. 3. If silence exceeds the threshold, the agent is probably dead — recover. 4. Otherwise, extend grace and try again next cycle. The default `silence_multiplier` is `1.5`. An agent whose median update interval is 60 seconds will only be recovered after more than 90 seconds of silence. An agent whose median is 180 seconds gets 270 seconds. ### 4. Recovery Callback Pattern **File**: `src/marcus_mcp/server.py` When `recover_expired_lease` fires, the task is reset on the board — but the server also holds **in-memory** tracking of who owns what: - `state.agent_tasks[agent_id]` — what the agent is currently assigned - `state.tasks_being_assigned` — tasks mid-assignment If those aren't cleaned up, the assignment filter will keep refusing to offer the recovered task to anyone, because it still looks taken. The lease manager solves this with a callback, set by the server: ```python self.lease_manager.on_recovery_callback = _on_recovery ``` Inside `_on_recovery`, the server removes the entry from `agent_tasks` and `tasks_being_assigned`. This keeps `AssignmentLeaseManager` free of direct dependencies on server state while still wiring the two together. ### 5. Touch-on-Any-Tool-Call **File**: `src/marcus_mcp/handlers.py` Marcus never asks agents to send heartbeats. Instead, **every MCP tool call from an agent acts as a heartbeat**. The dispatch loop inspects the tool arguments and, if an `agent_id` is present, calls: ```python await state.lease_manager.touch_lease(agent_id) ``` This means `log_decision`, `log_artifact`, `report_blocker`, `get_task_context`, and every other tool the agent might call all keep the lease alive. Agents prove they are working by working. ### 6. Lease Recreation on Progress Report There is one edge case the touch pattern can't cover: an agent survives a **false-positive recovery**. The cadence check misjudged their silence, the monitor recovered the task, then the agent calls `report_task_progress` — but there is no longer a lease to renew. The fix: when `report_task_progress` runs and the agent's lease is gone, it **recreates** the lease instead of failing. This means the agent continues their work, the monitor starts watching again, and at worst the task briefly showed as `TODO` on the board. ### 7. RecoveryInfo **File**: `src/core/models.py` A structured record attached to the task model when recovery happens. ```python @dataclass class RecoveryInfo: recovered_at: datetime recovered_from_agent: str previous_progress: int time_spent_minutes: float recovery_reason: str previous_agent_branch: Optional[str] instructions: str recovery_expires_at: datetime # 24h window ``` `RecoveryInfo` is **dual-written**: 1. Set on `task.recovery_info` (in-memory, source of truth for handoff) 2. Appended as a Kanban comment (durable audit trail) Because `recovery_info` is in-memory only, `server.refresh_project_state` explicitly **captures and re-applies** it across refreshes so that a refresh can't silently drop the handoff context. ### 8. Worktree-Aware Recovery Instructions Every Marcus agent works on its own git branch: `marcus/`. When an agent dies, commits they made still live on that branch. The recovery instructions tell the **next** agent exactly how to pick them up: ``` git merge marcus/ --no-edit git log marcus/ ``` The next agent merges committed work, reviews what was done, then continues from where the previous agent left off. This is the difference between "recovered" and "redone." ### 9. Recovery Handoff in Task Instructions **File**: `src/marcus_mcp/tools/task.py` — `build_tiered_instructions` Task instructions are built in layers. A new **Layer 1.1: Recovery Handoff** sits just above the normal task body. When `task.recovery_info` is set and not expired (24h window), the layer is populated with the full handoff message: previous agent ID, previous progress, time spent, recovery reason, and the git merge instructions. The next agent sees the handoff as soon as they receive the task — no separate notification, no risk of missing it. ### 10. Assignment Filter Respects `assigned_to` **File**: `src/marcus_mcp/tools/task.py` — `_find_optimal_task_original_logic` The assignment filter honors **both** in-memory tracking (`agent_tasks`, persistence) **and** the board-level `assigned_to` field. This has two important effects: 1. Design tasks are assigned to the literal string `"Marcus"` and are handled internally by `_run_design_phase`. The filter skips them so no agent tries to grab them. 2. Recovered tasks have `assigned_to` cleared by `recover_expired_lease`. Because the filter checks `assigned_to is None`, the task immediately re-enters the pool. ### 11. Gridlock Detector **File**: `src/core/gridlock_detector.py` A separate safety net. Rather than counting raw request volume (which produces false positives under Marcus's 30-second polling pattern), the detector looks at **task state**: if every `TODO` is blocked by unfinished dependencies and there are **zero** in-progress tasks, the system is gridlocked. It also tracks distinct requesting agents for metrics. ## Configuration All resilience tuning lives in `src/config/marcus_config.py` under `TaskLeaseSettings`. The aggressive defaults that match Marcus's real-world agent cadence are: | Setting | Default | Meaning | |---------|---------|---------| | `default_hours` | `0.025` | ~90 seconds base lease. | | `grace_period_minutes` | `0.5` | 30 seconds of grace after expiry. | | `min_lease_hours` | `0.0167` | 60 seconds — the floor. | | `max_lease_hours` | `0.0833` | 5 minutes — the ceiling. | | `warning_hours` | `0.01` | ~36s before expiry, emit a warning. | | `max_renewals` | `10` | Safety cap on renewal count. | | `stuck_threshold_renewals` | `5` | Flag for stuck-task detection. | | `silence_multiplier` | `1.5` | Cadence threshold multiplier. | | `enable_adaptive` | `true` | Enable progressive phases. | | `renewal_decay_factor` | `0.9` | Decay applied on renewal. | `priority_multipliers` and `complexity_multipliers` scale lease duration for high-priority or complex tasks. The dict-path fallback defaults in `server.py` mirror these values so config-less startup still matches the dataclass defaults. ## Full Recovery Flow (Agent Dies) The following trace shows everything that happens from assignment to handoff. ``` T+0s Agent-A requests a task. ├─ Task assigned: status=IN_PROGRESS, assigned_to=Agent-A ├─ state.agent_tasks[Agent-A] = task └─ AssignmentLeaseManager creates lease (Phase 1: 60s + 20s grace) T+15s Agent-A calls log_decision(...) └─ handlers.py touches lease → extended T+40s Agent-A calls report_task_progress(progress=15) ├─ lease renewed with progress └─ Phase transitions to 2 (90s + 30s grace) T+55s ☠️ Agent-A's tmux pane is killed. No more tool calls. T+175s Lease expires past grace. LeaseMonitor wakes up (60s interval). ├─ should_recover_expired_lease(lease): │ ├─ median_update_interval(Agent-A) = 25s │ ├─ silence_threshold = 25s * 1.5 = 37.5s │ ├─ current silence = 120s │ └─ 120s > 37.5s → RECOVER │ └─ recover_expired_lease(lease): ├─ Build RecoveryInfo( │ recovered_from_agent="Agent-A", │ previous_progress=15, │ time_spent_minutes=2.0, │ recovery_reason="lease_expired", │ previous_agent_branch="marcus/Agent-A", │ instructions="git merge marcus/Agent-A ...", │ recovery_expires_at=now+24h │ ) ├─ task.recovery_info = ├─ task.assigned_to = None ├─ Kanban: status=TODO, assigned_to=None ├─ Kanban comment with handoff text ├─ active_leases.pop(task_id) ├─ persistence.remove_assignment(Agent-A) └─ on_recovery_callback(Agent-A, task_id) └─ server cleans: ├─ state.agent_tasks.pop(Agent-A) └─ state.tasks_being_assigned.discard(task_id) T+180s Agent-B calls request_next_task. ├─ ensure_lease_monitor_running() (already running) ├─ Assignment filter walks TODO tasks: │ ├─ task.status == TODO ✓ │ ├─ task.id not in all_assigned_ids ✓ │ ├─ task.assigned_to is None ✓ │ └─ task selected │ ├─ build_tiered_instructions(task, agent=Agent-B): │ └─ Layer 1.1: Recovery Handoff │ "⚠️ RECOVERY ADDENDUM — recovered from Agent-A │ git merge marcus/Agent-A --no-edit │ git log marcus/Agent-A │ Previous agent reached 15% ..." │ └─ Lease created for Agent-B (Phase 1 again) T+181s Agent-B runs git merge marcus/Agent-A, sees Agent-A's commits, continues the task from 15%. ``` ## Design Task Handling Design tasks are a special case. They are created with `assigned_to="Marcus"` and handled internally by `_run_design_phase` as a background task on the server. The assignment filter treats any task whose `assigned_to` is `"Marcus"` as off-limits to agents. When the design task completes, it is marked `done` on the board, which unblocks its dependents through the normal dependency system. This is why the assignment filter must check `assigned_to` and not just the server's in-memory `agent_tasks`: the Marcus-owned design tasks don't live in `agent_tasks` at all. ## Key Architectural Decisions ### Polling over WebSocket heartbeats Marcus is board-mediated. Every durable piece of state lives on the board. Adding a parallel heartbeat channel would introduce a second source of truth with its own failure modes. Polling the leases every 60 seconds fits the existing pattern and is cheap: it's an in-memory walk of a dict. ### Cadence-based recovery over fixed timeouts Fixed timeouts force a choice between "fast detection" and "low false positive rate." Cadence-based recovery breaks the trade-off by adapting to each agent individually. An agent with a 20-second median update gets a 30-second silence window; an agent with a 3-minute median gets 4.5 minutes. ### Touch-on-any-tool as the liveness signal Explicit heartbeats would require every agent to opt in and stay in sync with the protocol. Touching the lease on any MCP tool call means the heartbeat is implicit in real work. Agents that are doing things stay alive. Agents that are stuck or dead stop touching. That is exactly the signal we want. ### Lease recreation on progress report Even a 3–5% false positive rate is unacceptable if it means the agent keeps running with no monitor watching. Recreating the lease on `report_task_progress` makes false positives **self-healing**: the system notices its mistake on the next progress update and resumes normal monitoring. ### Callback for state cleanup `AssignmentLeaseManager` does not import server state. The server injects a callback at construction time, which the manager fires on recovery. This keeps the lease module independently testable and prevents a circular dependency. ### Lazy monitor start on the correct loop The HTTP transport spins up its event loop per request context. A monitor created during `__init__` is bound to a loop that no longer exists by the time a request arrives. Deferring monitor start to the first `request_next_task` call — which runs on the live request loop — pins the monitor to the right loop and keeps it alive for the server lifetime. ## Testing Coverage for this system is split across unit, integration, and handoff tests. | Test | Path | Covers | |------|------|--------| | Assignment lease unit | `tests/unit/core/test_assignment_lease.py` | Lease lifecycle, touch/renew, progressive phases. | | Progressive timeout | `tests/unit/core/test_progressive_timeout.py` | Phase transitions and timeout calculation. | | Gridlock detector | `tests/unit/core/test_gridlock_detector.py` | Task-state gridlock detection. | | Recovery handoff | `tests/unit/mcp/test_recovery_handoff.py` | Layer 1.1 instructions and 24h expiry. | | Resilience end-to-end | `tests/integration/test_resilience_e2e.py` | Full kill → recover → reassign flow. | ## Troubleshooting ### Recovered tasks are not reassigned to any agent Check the assignment filter in `_find_optimal_task_original_logic`. A task will be skipped if any of the following are still true: - `task.assigned_to` is not `None` - `task.id` is still in `state.agent_tasks[some_agent]` - `task.id` is still in `state.tasks_being_assigned` All three must be cleared during recovery. If one is not, the `on_recovery_callback` wiring in `server.py` is broken or the lease manager was constructed without a callback set. ### Leases never expire even though agents are dead The `LeaseMonitor` is probably bound to the wrong event loop. Confirm that `ensure_lease_monitor_running()` is being called from `request_next_task` and that the first call actually runs. You can add a debug log in the monitor's poll loop to verify it is ticking. ### Recovery fires on agents that are actually alive Either the cadence check is too aggressive for your workload, or agents are not touching the lease frequently enough. Options: 1. Increase `silence_multiplier` (default `1.5` → try `2.0`). 2. Increase `default_hours` or `min_lease_hours` in `TaskLeaseSettings`. 3. Confirm that the tool the agent is calling passes `agent_id` in its arguments — if it doesn't, `touch_lease` is never called. ### `recovery_info` disappears after a refresh `refresh_project_state` in `server.py` must capture `recovery_info` before the refresh and re-apply it afterward. If this block is removed or reordered, handoff information is silently lost. The recovery_info field is in-memory only — it is not stored by the Kanban provider. The Kanban comment remains as an audit trail, but the next agent will not see the Layer 1.1 handoff in their task instructions. ### Design tasks are being offered to agents Confirm the assignment filter is checking `task.assigned_to != "Marcus"`. Design tasks rely on this exact string match. ### A false-positive recovery left an orphaned agent Normally the agent's next `report_task_progress` recreates the lease. If that is not happening, verify that the progress-report handler calls into `AssignmentLeaseManager` when the lease is missing rather than failing. Check the logs for "lease not found, recreating" or similar. ## Related Documentation - [Assignment Lease System](35-assignment-lease-system.md) — deeper dive into the lease data model and adaptive duration math. - [Orphan Task Recovery](33-orphan-task-recovery.md) — complementary recovery path for tasks left behind by non-lease mechanisms. - [Agent Coordination](21-agent-coordination.md) — how task assignment, progress reporting, and the assignment filter fit together. - [Smart Retry Strategy](46-smart-retry-strategy.md) — retry and backoff policies used alongside recovery.