Marcus Error Framework System#
Executive Summary#
The Marcus Error Framework is a comprehensive, autonomous agent-optimized error handling system that provides structured exception hierarchies, intelligent recovery strategies, circuit breaker patterns, and real-time monitoring capabilities. Unlike traditional error handling systems designed for human-operated applications, this framework is specifically engineered for autonomous agent environments where errors must be self-diagnosing, self-recovering, and provide actionable intelligence for both automated retry logic and human escalation.
System Architecture#
Core Components#
The Error Framework consists of four primary modules operating in concert:
Marcus Error Framework Architecture
βββ error_framework.py (Core Exception System)
β βββ MarcusBaseError (Base Exception Class)
β βββ ErrorContext (Rich Contextual Information)
β βββ RemediationSuggestion (Recovery Guidance)
β βββ Error Type Hierarchy (Domain-Specific Exceptions)
βββ error_strategies.py (Recovery Strategies)
β βββ RetryHandler (Exponential Backoff & Jitter)
β βββ CircuitBreaker (Cascade Failure Prevention)
β βββ FallbackHandler (Graceful Degradation)
β βββ ErrorAggregator (Batch Operation Handling)
βββ error_responses.py (Format Adapters)
β βββ MCP Protocol Responses
β βββ JSON API Responses
β βββ User-Friendly Messages
β βββ Logging & Monitoring Formats
βββ error_monitoring.py (Intelligence & Analytics)
βββ Pattern Detection Engine
βββ Correlation Analysis
βββ Health Scoring Algorithm
βββ Alert Management System
Error Type Taxonomy#
The framework implements a sophisticated six-tier error classification system:
Tier 1: Transient Errors (Auto-recoverable)
NetworkTimeoutError: Network operations exceeding time limitsServiceUnavailableError: Temporary external service outagesRateLimitError: API rate limit violations with retry timingTemporaryResourceError: Temporary system resource exhaustion
Tier 2: Configuration Errors (User-resolvable)
MissingCredentialsError: Absent authentication credentialsInvalidConfigurationError: Malformed configuration valuesMissingDependencyError: Required dependencies not installedEnvironmentError: Incorrect environment setup
Tier 3: Business Logic Errors (Logic violations)
TaskAssignmentError: Task allocation conflicts or impossibilitiesWorkflowViolationError: Workflow state machine violationsValidationError: Data validation failuresStateConflictError: System state inconsistenciesTaskValidationError: Task-specific validation failures (e.g., invalid task fields)ProjectRootNotFoundError: Unable to locate the project root directory
Tier 4: Integration Errors (External service issues)
KanbanIntegrationError: Kanban board connectivity/operation failuresAIProviderError: AI service integration failuresAuthenticationError: External service authentication failuresExternalServiceError: Generic external service errors
Tier 5: Security Errors (Critical security events)
AuthorizationError: Permission/authorization violationsWorkspaceSecurityError: Workspace isolation breachesPermissionError: File/resource permission violations
Tier 6: System Errors (Critical infrastructure failures)
ResourceExhaustionError: System resource depletionCorruptedStateError: Data corruption detectionDatabaseError: Database operation failuresCriticalDependencyError: Essential system component failures
Marcus Ecosystem Integration#
Position in System Architecture#
The Error Framework operates as a cross-cutting concern throughout the Marcus ecosystem:
MCP Server Layer: All MCP tool calls are wrapped with error handling for consistent response formatting
Agent Workflow Layer: Task assignment, progress reporting, and blocker handling utilize error context and retry strategies
Integration Layer: External service calls (Kanban, AI providers) are protected by circuit breakers and fallback mechanisms
Core Processing Layer: Business logic operations leverage validation and state conflict detection
Monitoring Layer: All errors feed into the monitoring system for pattern analysis and health scoring
Workflow Integration Points#
The Error Framework intercepts and enhances error handling at key workflow stages:
Typical Agent Workflow Error Integration:
create_project β register_agent β request_next_task β report_progress β report_blocker β finish_task
β β β β β β
Configuration Business Logic Integration Business Logic Integration Integration
Errors Errors Errors Errors Errors Errors
β β β β β β
No Retry Validation & Circuit Breaker Context Logging AI-Powered Final Cleanup
State Checking & Retry Logic & Monitoring Suggestions & Reporting
Error Context System#
Rich Contextual Information#
Every Marcus error carries comprehensive context through the ErrorContext class:
@dataclass
class ErrorContext:
# Operation identification
operation: str = "" # What was being attempted
operation_id: str = uuid4() # Unique operation identifier
# Agent context
agent_id: Optional[str] = None # Which agent encountered the error
task_id: Optional[str] = None # Current task being processed
agent_state: Optional[Dict] = None # Agent's current state snapshot
# System context
timestamp: datetime = now() # When error occurred
correlation_id: str = uuid4() # For tracing related operations
system_state: Optional[Dict] = None # System resource state
# Integration context
integration_name: Optional[str] = None # External service involved
integration_state: Optional[Dict] = None # Service-specific state
# Extensible context
user_context: Optional[Dict] = None # User-specific information
custom_context: Optional[Dict] = None # Operation-specific data
Context Automation#
The framework provides automatic context injection through the error_context context manager:
with error_context("kanban_sync", agent_id="agent_123", task_id="task_456"):
# Any MarcusBaseError raised here automatically includes context
sync_task_with_kanban()
Intelligence & Recovery Strategies#
Retry Logic with Exponential Backoff#
The retry system implements sophisticated backoff strategies.
Production implementation is in src/core/resilience.py:
@dataclass
class RetryConfig:
max_attempts: int = 3 # Maximum retry attempts
base_delay: float = 1.0 # Initial delay in seconds
max_delay: float = 60.0 # Maximum delay cap
exponential_base: float = 2.0 # Backoff multiplier
jitter: bool = True # Add randomization
Note:
src/core/error_strategies.pycontains a more fully-featuredRetryConfigwith additional fields (retry_on,stop_on,multiplier), but this module is NOT integrated into the production codebase. Theresilience.pyversion above is what is actually used.
Jitter Algorithm (actual implementation using secrets.SystemRandom()):
delay *= 0.5 + secure_random.random() β gives 50%β150% of the calculated delay,
where secure_random = secrets.SystemRandom() (cryptographically secure).
Benefits:
Prevents thundering herd problems
Configurable per operation type
Cryptographically secure jitter
Circuit Breaker Pattern#
Prevents cascading failures through intelligent service protection.
Production implementation is the CircuitBreaker class in src/core/resilience.py.
State is tracked as a plain string attribute ("closed", "open", "half-open").
Note:
src/core/error_strategies.pyhas aCircuitBreakerStateenum (singular), but that module is NOT integrated into production. There is no class calledCircuitBreakerStates(plural) anywhere in the codebase.
State Transitions:
"closed"β"open": Afterfailure_thresholdconsecutive failures"open"β"half-open": Afterrecovery_timeoutduration expires"half-open"β"closed": On successful test call"half-open"β"open": On any failure during testing
Autonomous Benefits:
Prevents agents from hammering failing services
Automatic recovery testing
Service health awareness
Resource conservation
Fallback Mechanisms#
Graceful degradation through priority-ordered fallback functions:
fallback_handler = FallbackHandler("task_creation")
fallback_handler.add_fallback(create_task_locally, priority=1) # Try first
fallback_handler.add_fallback(queue_for_later, priority=2) # Then this
fallback_handler.add_fallback(use_cached_template, priority=3) # Finally this
Fallback Strategy Selection:
Primary function attempted
On failure, fallbacks tried in priority order
First successful fallback result returned
If all fail, cached results used if available
If no cache, enhanced error with exhausted fallback information
Real-Time Monitoring & Pattern Detection#
Error Pattern Detection#
The monitoring system identifies four categories of error patterns:
1. Frequency Patterns: Same error type occurring repeatedly
Threshold: 5+ occurrences within 10 minutes
Detection: Error type fingerprinting
Action: Pattern alert with error type analysis
2. Burst Patterns: High error volume in short timeframe
Threshold: 10+ errors within 5 minutes (any type)
Detection: Time-window error counting
Action: System stability alert
3. Agent-Specific Patterns: High error rate from individual agents
Threshold: 20+ errors from single agent within 30 minutes
Detection: Agent ID error aggregation
Action: Agent health check recommendation
4. Cascade Patterns: Related errors occurring in sequence
Threshold: 3+ similar errors with 70%+ similarity within 5 minutes
Detection: Multi-dimensional error similarity scoring
Action: Root cause investigation trigger
Error Similarity Algorithm#
The framework calculates error similarity using weighted factors:
def calculate_similarity(error1, error2) -> float:
factors = []
if error1.error_type == error2.error_type: factors.append(0.4) # 40% weight
if error1.operation == error2.operation: factors.append(0.3) # 30% weight
if error1.integration == error2.integration: factors.append(0.2) # 20% weight
if abs(error1.timestamp - error2.timestamp) < 60s: factors.append(0.1) # 10% weight
return sum(factors) # 0.0 to 1.0 similarity score
Health Scoring Algorithm#
System health calculated as weighted score (0-100):
health_score = 100
if error_rate_per_minute > 10: health_score -= 30
elif error_rate_per_minute > 5: health_score -= 15
elif error_rate_per_minute > 2: health_score -= 5
if critical_errors > 0: health_score -= 25
health_score -= active_patterns * 10 # 10 points per active pattern
health_score = max(0, health_score)
Health Status Mapping:
90-100: Excellent
75-89: Good
50-74: Fair
25-49: Poor
0-24: Critical
Response Format Adapters#
MCP Protocol Format#
Optimized for Claude Code agent consumption:
{
"success": false,
"error": {
"code": "KANBAN_INTEGRATION_ERROR",
"message": "Failed to create task on board 'Development'",
"type": "KanbanIntegrationError",
"severity": "medium",
"retryable": true,
"context": {
"operation": "create_task",
"correlation_id": "corr_abc123",
"agent_id": "agent_dev_001",
"task_id": "task_456"
},
"remediation": {
"immediate": "Retry task creation with exponential backoff",
"fallback": "Create task locally and sync when service recovers",
"retry": "Automatic retry in 2.5 seconds (attempt 2/3)"
}
}
}
User-Friendly Format#
Human-readable error presentation:
Unable to create task on Kanban board due to service timeout.
π‘ What to do: The system will retry automatically in 30 seconds
π Alternative: Task has been created locally and will sync when the service recovers
π Retry: This is attempt 2 of 3 - if all attempts fail, the task will remain in local queue
Logging Format#
Structured for log analysis and debugging:
{
"level": "error",
"timestamp": "2025-07-14T15:30:45.123Z",
"error_code": "KANBAN_INTEGRATION_ERROR",
"error_type": "KanbanIntegrationError",
"correlation_id": "corr_abc123",
"operation": "create_task",
"agent_id": "agent_dev_001",
"task_id": "task_456",
"integration": "planka_board",
"retryable": true,
"severity": "medium",
"caused_by": "requests.exceptions.Timeout",
"custom_context": {
"board_id": "board_789",
"task_title": "Implement user authentication"
}
}
Workflow Stage Integration#
create_project Stage#
Error Types: Configuration, Validation Handling:
Immediate validation of project parameters
Configuration error detection and user guidance
No retries (user input required)
Clear remediation instructions
register_agent Stage#
Error Types: Business Logic, System Handling:
Agent ID validation and conflict detection
Capability matching verification
State initialization error recovery
Agent registry consistency checking
request_next_task Stage#
Error Types: Integration, Business Logic, Transient Handling:
Kanban integration with circuit breaker protection
Task assignment algorithm error recovery
AI-powered task matching with fallbacks
Dependency conflict resolution
report_progress Stage#
Error Types: Integration, Validation, Transient Handling:
Progress validation against task constraints
Kanban sync with retry logic
State consistency verification
Context preservation for correlation
report_blocker Stage#
Error Types: Integration, AI Provider, Business Logic Handling:
AI-powered suggestion generation with fallbacks
Blocker classification and severity assessment
Escalation path determination
Context aggregation for pattern analysis
finish_task Stage#
Error Types: Integration, System, Validation Handling:
Task completion validation
Final state synchronization
Cleanup operation error handling
Correlation group closure
Simple vs Complex Task Handling#
Simple Tasks (< 3 dependencies, single agent)#
Error Strategy:
Basic retry logic (3 attempts, 1s base delay)
Simple circuit breaker (5 failure threshold)
Minimal context collection
Standard monitoring
@with_retry(RetryConfig(max_attempts=3, base_delay=1.0))
async def handle_simple_task():
# Lightweight error handling
pass
Complex Tasks (> 3 dependencies, multi-agent coordination)#
Error Strategy:
Enhanced retry logic (5 attempts, 2s base delay)
Sensitive circuit breaker (3 failure threshold)
Rich context collection including dependency state
Enhanced monitoring with pattern correlation
@with_retry(RetryConfig(max_attempts=5, base_delay=2.0))
async def handle_complex_task():
with error_context("complex_task",
agent_id=agent_id,
task_id=task_id,
custom_context={"dependencies": dep_list}):
# Enhanced error handling with dependency awareness
pass
Complex Task Enhancements:
Dependency state tracking in error context
Multi-agent error correlation
Cascade failure prevention
Enhanced pattern detection sensitivity
Board-Specific Considerations#
Kanban Provider Abstraction#
The Error Framework integrates with Marcusβs Kanban provider abstraction layer:
Planka Provider Errors:
Connection timeouts: 30s timeout with 3 retries
Authentication failures: No retry, immediate credential refresh
Rate limiting: Exponential backoff with provider-specific limits
Board access errors: Permission validation and fallback board selection
Generic Kanban Errors:
Provider detection and capability matching
Failover between multiple configured providers
Provider-specific error code translation
Board synchronization conflict resolution
Board State Consistency#
Error Scenarios:
Task creation conflicts during multi-agent operations
Board state drift during network partitions
Concurrent modification conflicts
Board access permission changes
Resolution Strategies:
Optimistic locking with conflict detection
Last-writer-wins with conflict notification
Manual merge conflict resolution
Fallback to local state with delayed sync
Technical Implementation Details#
Error Context Propagation#
The error_context context manager is implemented in src/core/error_framework.py as a plain
@contextmanager. It modifies a MarcusBaseError in-place if one is raised within the block.
There is no ContextVar, no current_error_context variable, and no token-based reset.
@contextmanager
def error_context(operation: str, **context_kwargs):
try:
yield
except MarcusBaseError as e:
# Inject context fields into the error in-place
if not e.context.operation:
e.context.operation = operation
for key, value in context_kwargs.items():
setattr(e.context, key, value)
raise
Memory Management#
The monitoring system implements intelligent memory management:
class ErrorMonitor:
def __init__(self):
self.error_history: deque = deque(maxlen=10000) # Ring buffer
self.pattern_cleanup_threshold = timedelta(days=7)
self.correlation_timeout = timedelta(hours=24)
def _cleanup_old_data(self):
# Automatic cleanup of old patterns and correlations
# Prevents memory leaks in long-running agents
pass
Async/Sync Compatibility#
The framework provides seamless async/sync compatibility:
def with_retry(config: RetryConfig = None):
def decorator(func):
@wraps(func)
async def async_wrapper(*args, **kwargs):
# Async implementation
pass
@wraps(func)
def sync_wrapper(*args, **kwargs):
# Sync implementation using asyncio.run()
pass
return async_wrapper if asyncio.iscoroutinefunction(func) else sync_wrapper
return decorator
Serialization Safety#
All error data structures are designed for safe JSON serialization:
@dataclass
class ErrorContext:
def to_dict(self) -> Dict[str, Any]:
return {
'operation': self.operation,
'timestamp': self.timestamp.isoformat(),
'custom_context': self.custom_context or {}
}
Pros and Cons Analysis#
Advantages#
1. Autonomous Agent Optimization
Self-diagnosing errors with actionable remediation
Automatic retry and recovery strategies
Context-aware error handling
Pattern detection for proactive issue identification
2. Comprehensive Error Intelligence
Rich contextual information for debugging
Multi-dimensional error correlation
Real-time health monitoring and scoring
Predictive pattern analysis
3. Integration-Friendly Design
Multiple response format adapters
Seamless legacy code integration
Configurable retry and circuit breaker policies
Extensible error type hierarchy
4. Production-Ready Features
Memory-efficient monitoring with cleanup
Thread-safe and async-compatible
Structured logging integration
Security-conscious sensitive data handling
5. Developer Experience
Decorator-based easy integration
Context manager automatic error enhancement
Clear error type classification
Comprehensive remediation suggestions
Disadvantages#
1. Complexity Overhead
Substantial codebase complexity for simple applications
Learning curve for new developers
Additional memory footprint for monitoring
Configuration complexity for advanced features
2. Performance Considerations
Context collection overhead on every error
Monitoring system background processing
Pattern detection computational cost
Serialization overhead for error responses
3. Framework Lock-in
Marcus-specific error types create vendor lock-in
Migration complexity from existing error handling
Dependency on Marcus ecosystem components
Framework-specific debugging knowledge required
4. Configuration Complexity
Multiple configuration layers (retry, circuit breaker, monitoring)
Environment-specific tuning requirements
Provider-specific error mapping complexity
Fine-tuning required for optimal performance
Design Rationale#
Why This Approach Was Chosen#
1. Autonomous Agent Requirements Traditional error handling systems assume human operators who can interpret error messages and take corrective action. Marcus agents require:
Machine-interpretable error classifications
Automatic recovery strategies
Rich context for correlation across operations
Predictive pattern analysis for proactive issue resolution
2. Microservices-Style Error Handling The framework treats each component (Kanban integration, AI providers, task assignment) as independent services requiring:
Circuit breaker protection against cascade failures
Service-specific retry strategies
Fallback mechanisms for graceful degradation
Health monitoring and automatic service discovery
3. Observable System Design Error handling as a first-class observability concern:
Every error contributes to system health understanding
Pattern detection enables proactive issue resolution
Error correlation provides root cause analysis
Health scoring guides system optimization
4. Developer Experience Priority Balancing power with usability:
Decorator-based integration for minimal code changes
Context managers for automatic error enhancement
Clear error type hierarchy for easy classification
Multiple response formats for different consumption patterns
Alternative Approaches Considered#
1. Simple Exception Hierarchy Rejected: Insufficient for autonomous agent needs
No automatic retry logic
No context preservation
No pattern detection capabilities
No service protection mechanisms
2. External Error Management Service Rejected: Added complexity and latency
Network dependency for error handling
Additional service to maintain and monitor
Latency impact on error processing
Single point of failure
3. Framework-Agnostic Error Handling Rejected: Generic solutions lack domain specificity
No Marcus-specific error types
No integration with agent workflow
No Kanban provider awareness
No AI provider error handling
Evolution and Future Roadmap#
Short-term Evolution (3-6 months)#
1. Enhanced AI Integration
GPT-4 powered error analysis and remediation suggestions
Automatic root cause analysis using error correlations
Predictive error modeling based on historical patterns
Context-aware error severity adjustment
2. Advanced Pattern Detection
Machine learning-based pattern recognition
Seasonal and cyclical error pattern detection
Cross-agent error correlation analysis
Predictive failure forecasting
3. Performance Optimizations
Streaming error data processing
Compressed error history storage
Lazy error context evaluation
Background pattern analysis
Medium-term Evolution (6-12 months)#
1. Distributed Error Management
Multi-instance error correlation
Distributed circuit breaker coordination
Global system health aggregation
Cross-deployment error pattern sharing
2. Self-Healing Capabilities
Automatic configuration adjustment based on error patterns
Dynamic retry strategy optimization
Self-tuning circuit breaker thresholds
Autonomous remediation action execution
3. Advanced Monitoring Integration
Prometheus metrics export
Grafana dashboard templates
AlertManager integration
Custom metric collection and analysis
Long-term Evolution (1+ years)#
1. Predictive Error Prevention
Pre-error condition detection
Proactive remediation action triggering
Resource usage prediction and scaling
Failure cascade prevention
2. Cross-System Error Learning
Error pattern sharing between Marcus instances
Community-driven error knowledge base
Automated error handling best practice evolution
Cross-domain error pattern recognition
3. Advanced Recovery Strategies
AI-powered custom recovery strategy generation
Dynamic fallback chain optimization
Context-aware recovery strategy selection
Self-evolving error handling policies
Integration Examples#
MCP Tool Integration#
async def mcp_create_task(arguments: Dict[str, Any]) -> Dict[str, Any]:
try:
with error_context("mcp_create_task",
custom_context={"tool": "create_task", "args": arguments}):
result = await task_service.create_task(arguments)
return {"success": True, "result": result}
except Exception as e:
return handle_mcp_tool_error(e, "create_task", arguments)
Agent Workflow Integration#
@with_retry(RetryConfig(max_attempts=3))
@with_circuit_breaker("kanban_service")
async def sync_agent_progress(agent_id: str, task_id: str, progress: int):
with error_context("progress_sync", agent_id=agent_id, task_id=task_id):
await kanban_provider.update_task_progress(task_id, progress)
record_agent_event("progress_updated", agent_id, {"task_id": task_id, "progress": progress})
Legacy Code Migration#
# Before: Basic error handling
try:
result = external_service_call()
return {"success": True, "data": result}
except Exception as e:
logger.error(f"Service call failed: {e}")
return {"success": False, "error": str(e)}
# After: Marcus Error Framework
@with_retry()
@with_circuit_breaker("external_service")
async def safe_external_service_call():
with error_context("external_service_call"):
return await external_service_call()
try:
result = await safe_external_service_call()
return {"success": True, "data": result}
except Exception as e:
return create_error_response(e, ResponseFormat.MCP)
Conclusion#
The Marcus Error Framework represents a paradigm shift from reactive error handling to proactive error intelligence. By treating errors as valuable system intelligence rather than mere exceptions, the framework enables autonomous agents to operate more reliably, recover more intelligently, and provide better visibility into system health.
The frameworkβs multi-tiered approachβfrom simple retry logic to sophisticated pattern detectionβallows it to scale from basic error recovery to advanced system intelligence. Its integration with the broader Marcus ecosystem ensures that error handling is not an afterthought but a core system capability that enhances every aspect of autonomous agent operation.
As Marcus continues to evolve toward more sophisticated autonomous operation, the Error Framework provides the foundation for self-healing, self-monitoring, and self-optimizing agent systems that can operate reliably in complex, distributed environments with minimal human intervention.