# 33. Orphan Task Recovery (Safety Net) > **Primary recovery mechanism:** The main path for detecting dead agents > and recovering their tasks is the lease system. See > [Resilience and Task Recovery System](34-agent-recovery-system.md) and > [Assignment Lease System](35-assignment-lease-system.md). > > This document describes the **out-of-band safety net** that runs > alongside the lease system: startup reconciliation and the assignment > monitor. These mechanisms exist to catch assignments that slip past the > lease-based path — mismatches between in-memory state, persistence, and > the kanban board — which can happen during server restarts or when code > paths update the board without touching the lease manager. ## Where It Fits Marcus has three layers of protection against orphaned tasks: 1. **Primary — Assignment Lease System.** Every in-progress task has a short lease (seconds to minutes). Dead agents are detected within one to two monitor cycles via cadence-based recovery. This is the path that runs while the server is live and handles the common case of "agent tmux pane was killed." See [Resilience and Task Recovery System](34-agent-recovery-system.md). 2. **Safety net — Assignment Monitor.** A separate 30-second polling loop that watches for state **reversions** on the board (a task that was `IN_PROGRESS` drops back to `TODO` without going through the lease manager) and cleans stale persistence entries. This catches edge cases where the board is updated outside of Marcus or where a code path forgets to notify the lease manager. 3. **Startup — Assignment Reconciler.** A one-shot pass at server boot that validates every persisted assignment against the live kanban state, removes stale entries, and restores orphaned `IN_PROGRESS` tasks that exist on the board but are missing from persistence. Taken together, the lease system handles **live liveness detection**, while the components in this document handle **state consistency** across restarts and between Marcus's three stores of truth (in-memory, persistence, kanban board). ## Components ### 1. Assignment Monitor (`AssignmentMonitor`) **File:** `src/monitoring/assignment_monitor.py` A background task that polls every `check_interval` seconds (default 30) looking for task state reversions. It does **not** perform liveness detection — that is the lease manager's job. It specifically catches cases where: - The board shows a task flipped back to `TODO` but persistence still thinks it is assigned. - The board shows a task was reassigned to a different agent than the one persistence has recorded. - The board shows a task as `DONE` but persistence still has it assigned. - The board shows a task `BLOCKED` with no assignee, but persistence still has it assigned. ```python async def _detect_reversion(self, task: Task, worker_id: str) -> bool: # Case 1: Task went back to TODO (could be lease recovery from a # previous server instance, or a manual board edit). if task.status == TaskStatus.TODO: return True # Case 2: Task is IN_PROGRESS but assigned to a different worker. if task.status == TaskStatus.IN_PROGRESS and task.assigned_to != worker_id: return True # Case 3: Task completed by someone else. if task.status == TaskStatus.DONE and task.assigned_to != worker_id: return True # Case 4: Task blocked with no assignee. if task.status == TaskStatus.BLOCKED and not task.assigned_to: return True return False ``` When a reversion is detected, the monitor removes the stale persistence entry. It does **not** generate `RecoveryInfo` or handoff instructions — that is the lease system's responsibility on the primary path. The monitor is strictly a bookkeeping cleanup layer. #### Monitoring Loop ```python async def _monitor_loop(self) -> None: while self._running: try: await self._check_for_reversions() await asyncio.sleep(self.check_interval) # default: 30 seconds except Exception as exc: logger.error(f"Error in assignment monitor: {exc}") await asyncio.sleep(self.check_interval) ``` ### 2. Assignment Reconciler (`AssignmentReconciler`) **File:** `src/core/assignment_reconciliation.py` Runs at server startup (and on demand) to reconcile persistence with the live board. It walks every persisted assignment and every `IN_PROGRESS` task on the board, then decides what to do for each pair: | Persistence | Board Status | Board `assigned_to` | Action | |---|---|---|---| | Present | TODO | None | Remove assignment (task was reverted). | | Present | DONE | Different agent | Remove assignment (finished by someone else). | | Present | IN_PROGRESS | Same agent | Keep — valid assignment. | | Present | IN_PROGRESS | Different agent | Remove assignment (reassigned). | | Missing | IN_PROGRESS | Any agent | **Restore** — orphaned in-progress task. | | Present | Task missing | N/A | Remove assignment (task deleted). | Restored assignments do not carry `RecoveryInfo`, because the reconciler runs before any agent has asked for work and does not know whether the restored in-progress task actually needs recovery or is simply one that survived the restart. ### 3. Assignment Health Checker **File:** `src/monitoring/assignment_monitor.py` (bundled with `AssignmentMonitor`) Exposes `check_assignment_health()` for the `ping` health endpoint. It reports counts of persisted assignments, kanban-assigned tasks, and any mismatches between the two — a useful operational check because those mismatches are exactly what the monitor and reconciler exist to fix. ## When the Safety Net Fires In steady state, the lease system handles everything and the safety net does almost nothing. These components earn their keep at the seams: - **Server restart.** While Marcus is down, a lease monitor cannot run. On boot, the reconciler checks every persisted assignment against the live board and fixes whatever has drifted. - **External board edits.** A human (or another tool) resets a task to `TODO` on the board. The lease manager has no event for this. On the next 30-second tick the assignment monitor notices and clears the stale persistence entry. - **Non-lease code paths.** A code path updates the board status without touching the lease manager (e.g. a legacy integration or a bulk operation). The monitor catches the mismatch and cleans up. - **Persistence/board split-brain.** Whatever the board says wins. The reconciler treats the kanban board as the source of truth and updates persistence to match. ## Relationship to the Lease System | Concern | Primary Path | Safety Net | |---|---|---| | Detect dead agents | Lease monitor + cadence check | — | | Touch-on-any-tool liveness | Lease manager | — | | Reset task to TODO on board | Lease manager | — | | Build `RecoveryInfo` + handoff | Lease manager | — | | Clean in-memory `agent_tasks` | `on_recovery_callback` | — | | Clean stale persistence | Lease manager on recovery | Assignment monitor (reversions) | | Reconcile across server restart | — | Assignment reconciler | | Restore orphaned IN_PROGRESS | — | Assignment reconciler | | Detect board/persistence drift | — | Assignment health checker | The lease system owns the **live** path. The safety net owns the **consistency** path. They do not duplicate work, and the monitor does not try to generate recovery handoffs — if handoff context is needed, the task should go through a lease recovery, not a reversion cleanup. ## Configuration ```python monitor = AssignmentMonitor( persistence=assignment_persistence, kanban_client=kanban_client, check_interval=30, # seconds ) await monitor.start() ``` The `force_reconciliation()` method is on `AssignmentMonitor` (in `src/monitoring/assignment_monitor.py`), not on `AssignmentReconciler`. `AssignmentReconciler` only exposes `reconcile_assignments()`; `AssignmentMonitor.force_reconciliation()` calls that internally. Startup reconciliation is **not** automatic — it only runs when `force_reconciliation()` is explicitly called. The server initialization method is `_initialize_monitoring_systems` (not `_initialize_persistence`) and it sets up the monitor but does not trigger an immediate reconciliation pass. ## Limitations 1. The assignment monitor relies on kanban board polling. If the board provider is temporarily unavailable, the monitor logs the error and keeps retrying on the next interval. 2. The monitor performs **cleanup only**, not handoff. A task cleaned up by the monitor will be re-offered to agents without the Layer 1.1 recovery handoff in its instructions. If you need a handoff with `git merge` instructions for the next agent, the task should be recovered through the lease system while the server is live, not picked up by the reconciler after a restart. 3. The reconciler prefers the board as the source of truth. If the board state itself is wrong, the reconciler will propagate that wrongness into persistence. ## Related Documentation - [Resilience and Task Recovery System](34-agent-recovery-system.md) — the primary spec for agent liveness detection, cadence-based recovery, touch-on-any-tool-call, lease recreation, the recovery callback, the lazy monitor start, and the worktree-aware handoff flow. - [Assignment Lease System](35-assignment-lease-system.md) — lease data model, progressive timeout phases, and aggressive defaults (90s lease, 30s grace, 60s min, 5min max). - [Agent Coordination](21-agent-coordination.md) — task assignment flow and the assignment filter.