Smart Retry Strategy#

Overview#

Marcus uses an intelligent retry calculation system that prevents agents from sleeping through available work while minimizing unnecessary polling. The system prioritizes tasks that unlock parallel work and detects early task completions.

Problem Statement#

In multi-agent systems, idle agents need to know when to check back for new work. Traditional approaches have two failure modes:

Sleeping through tasks: Agent waits for full estimated completion time, but task finishes early and another agent takes the newly available work
Excessive polling: Agent checks too frequently, wasting resources and API calls

Additionally, agents may wake up for sequential work that current workers could easily handle, missing opportunities to wait for tasks that unlock parallel work better suited for multiple idle agents.

Solution: Smart Retry with Parallel Work Prioritization#

Marcus implements a two-part strategy:

1. Parallel Work Prioritization#

The system analyzes the dependency graph to determine which in-progress tasks will unlock the most parallel work:

# Count idle agents waiting for work
idle_agents = total_agents - busy_agents

# Prioritize tasks that unlock enough parallel work for idle agents
high_value_tasks = [
    task for task in in_progress_tasks
    if task.unlocks_count >= idle_agents
]

Benefits:

Agents wake up when parallel work becomes available
Prevents waking for sequential work current workers can handle
Maximizes utilization of idle agent capacity

2. Early Completion Detection#

Instead of waiting for the full estimated completion time, agents check back at 60% of the ETA with a 5-minute maximum:

retry_after = int(target_task["eta_seconds"] * 0.6)
retry_after = max(30, retry_after)  # Minimum 30 seconds
retry_after = min(retry_after, 300)  # Maximum 5 minutes

Benefits:

Catches tasks that finish faster than estimated
Regular re-polling for long-running tasks
Avoids excessive polling with minimum 30-second interval

How It Works#

Step 1: Calculate ETAs for In-Progress Tasks#

For each in-progress task, Marcus calculates estimated time to completion based on progress:

if progress > 0 and progress < 100:
    # Use actual progress to estimate
    estimated_total_seconds = (elapsed_seconds / progress) * 100
    remaining_seconds = estimated_total_seconds - elapsed_seconds
else:
    # Fall back to historical median
    remaining_seconds = global_median_hours * 3600

Example:

Task at 25% progress after 100 seconds
Estimated total: (100 / 25) × 100 = 400 seconds
Remaining: 400 - 100 = 300 seconds ETA

Step 2: Analyze Dependency Graph#

For each task, count how many tasks it will unlock:

dependent_task_ids = [
    t.id for t in project_tasks
    if task.id in (t.dependencies or [])
]
unlocks_count = len(dependent_task_ids)

Step 3: Prioritize High-Value Tasks#

Select tasks that unlock enough parallel work for idle agents:

# If we have tasks that unlock parallel work, prioritize those
# Otherwise fall back to any task completion
candidate_tasks = high_value_tasks if high_value_tasks else all_tasks

Step 4: Calculate Retry Time#

Apply the 60% rule with bounds:

retry_after = int(target_task["eta_seconds"] * 0.6)
retry_after = max(30, retry_after)    # Min 30s
retry_after = min(retry_after, 300)   # Max 5min

Example Scenarios#

Scenario 1: Prioritizing Parallel Work#

Setup:

2 agents total
1 agent busy
1 agent idle

In-Progress Tasks:

Task A: ETA 300s, unlocks 1 task (sequential)
Task B: ETA 400s, unlocks 2 tasks (parallel)

Decision:

Old logic: Wait for Task A (~330s with buffer)
New logic: Wait for Task B at 240s (60% of 400s)
Rationale: Task B unlocks enough work for the idle agent

Scenario 2: Early Completion Detection#

Setup:

Task ETA: 500 seconds (8.3 minutes)
Actual completion: 350 seconds (5.8 minutes)

Timeline:

Old logic: Wait 550s, miss the completion at 350s
New logic: Check at 300s (60%), catch completion at 350s
Benefit: Agent discovers work 200+ seconds earlier

Scenario 3: Long-Running Tasks#

Setup:

Task ETA: 1200 seconds (20 minutes)
Progress updates every few minutes

Behavior:

Calculated retry: 1200 × 0.6 = 720 seconds
Actual retry: 300 seconds (5-minute cap)
Result: Agent re-polls every 5 minutes to catch early completion or updated progress

Scenario 4: Fast Sequential Work#

Setup:

3 agents: 2 busy, 1 idle
Task A: ETA 60s, unlocks 1 task
Task B: ETA 90s, unlocks 1 task

Behavior:

No tasks unlock >= 1 parallel work slots
Falls back to soonest completion (Task A)
Retry: 60 × 0.6 = 36 seconds
Actual: 36 seconds (above 30s minimum)

Configuration#

The retry strategy uses these constants (in src/marcus_mcp/tools/task.py):

Parameter	Value	Purpose
`retry_percentage`	0.6 (60%)	Check at 60% of ETA for early completion
`min_retry_seconds`	30	Prevent excessive polling
`max_retry_seconds`	300 (5 min)	Regular re-polling for long tasks
`no_work_retry`	300 (5 min)	Default when no tasks in progress

Benefits#

Reduced Idle Time: Agents wake up at optimal times for available work
Better Resource Utilization: Prioritizes parallel work over sequential work
Early Detection: Catches tasks completing faster than estimated
Scalability: Adapts to varying numbers of agents and task patterns
Cost Efficiency: Avoids unnecessary polling while staying responsive

Implementation Details#

Location#

The smart retry logic is implemented in:

File: src/marcus_mcp/tools/task.py
Function: calculate_retry_after_seconds(state: Any) -> Dict[str, Any]
Lines: ~453

Return Value#

{
    "retry_after_seconds": 180,  # Time to wait
    "reason": "Waiting for 'Setup Database' to complete (~6 min, 40% done) (unlocks 2 tasks)",
    "blocking_task": {
        "id": "task-123",
        "name": "Setup Database",
        "progress": 40,
        "eta_seconds": 300
    }
}

Integration Points#

The retry calculation is called by:

request_next_task MCP tool when no suitable tasks are available
Task assignment logic when all agents are busy

Monitoring#

Logs#

Watch for retry decisions in Marcus logs:

[INFO] Agent agent-2 requesting next task
[INFO] No suitable tasks - retry in 120 seconds
[INFO] Reason: Waiting for 'API Implementation' to complete (~8 min, 30% done) (unlocks 3 tasks)

Metrics#

Track these metrics to evaluate retry effectiveness:

Average idle time: Time agents spend waiting for work
Missed opportunities: Tasks completed while agents slept
Polling frequency: How often agents check for work
Parallel utilization: Percentage of time multiple agents work simultaneously

Future Enhancements#

Potential improvements to the retry strategy:

Machine learning: Learn actual completion time patterns per task type
Agent skill matching: Factor in which agents can handle unlocked tasks
Priority weighting: Consider task priority in addition to parallelism
Dynamic percentage: Adjust retry percentage based on historical accuracy
Network awareness: Account for distributed agents with varying latency

References#

Resilience and Recovery: See Resilience and Task Recovery System
Task Assignment: See Assignment Lease System
Dependency Management: See Task Dependency System
Agent Coordination: See Agent Coordination
Optimal Agent Scheduling: See Optimal Agent Scheduling System