Marcus Testing Framework#
Table of Contents#
Overview#
The Marcus Testing Framework is a comprehensive, multi-layered testing system designed specifically for autonomous agent environments. It provides structured test organization, rich error handling, and intelligent test automation to ensure reliable operation of the Marcus ecosystem.
What the System Does#
The Testing Framework provides:
Structured Test Organization: Clear separation of unit, integration, performance, and future feature tests
Autonomous Agent Testing: Specialized fixtures and utilities for testing agent interactions
Rich Error Context: Integration with Marcus Error Framework for detailed failure analysis
Intelligent Test Discovery: Automated test placement based on dependency analysis
Performance Benchmarking: Comprehensive performance testing with scaling metrics
TDD Support: Future features testing for test-driven development workflows
System Architecture#
Marcus Testing Framework
βββ Test Organization Layer
β βββ Unit Tests (Isolated, < 100ms)
β βββ Integration Tests (Real services)
β βββ Performance Tests (Benchmarking)
β βββ Future Features (TDD)
βββ Test Infrastructure Layer
β βββ Fixtures & Factories
β βββ Mock Systems
β βββ Base Test Classes
βββ Automation Layer
β βββ Test Discovery
β βββ Async Test Management
β βββ Coverage Analysis
βββ Integration Layer
βββ MCP Protocol Testing
βββ Kanban Integration Testing
βββ Error Framework Testing
Ecosystem Integration#
Core Marcus Systems Integration#
The Testing Framework deeply integrates with all Marcus core systems:
Error Framework Integration:
# tests/unit/core/test_error_framework.py
from src.core.error_framework import (
MarcusBaseError, ErrorContext, RemediationSuggestion,
KanbanIntegrationError, AIProviderError
)
class TestErrorFramework:
def test_error_context_creation(self):
"""Test creating error context with agent tracking"""
context = ErrorContext(
operation="task_assignment",
agent_id="agent-001",
task_id="TASK-123"
)
assert context.correlation_id is not None
MCP Server Integration:
# tests/unit/mcp/test_marcus_server_complete.py
@pytest.mark.anyio
@pytest.mark.parametrize("anyio_backend", ["asyncio"])
async def test_mcp_tool_execution():
"""Test MCP tool execution with proper context"""
server = MarcusServer()
result = await server.call_tool("register_agent", {
"agent_id": "test-001",
"name": "Test Agent",
"role": "Developer"
})
assert result["success"] is True
AI Engine Integration:
# tests/integration/ai/test_prd_parser_real_ai.py
@pytest.mark.integration
async def test_ai_prd_analysis():
"""Test AI engine with real provider integration"""
engine = AIAnalysisEngine()
result = await engine.analyze_prd(sample_prd_text)
assert result.confidence > 0.8
External System Integration#
Kanban Provider Testing:
# Shared fixture in conftest.py
@pytest.fixture
async def mcp_session() -> AsyncGenerator[ClientSession, None]:
"""MCP session connected to Kanban server"""
server_params = StdioServerParameters(
command="/opt/homebrew/bin/node",
args=["/Users/lwgray/dev/kanban-mcp/dist/index.js"],
env={
"PLANKA_BASE_URL": "http://localhost:3333",
"PLANKA_AGENT_EMAIL": "demo@demo.demo",
"PLANKA_AGENT_PASSWORD": "demo" # pragma: allowlist secret
}
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
yield session
Test Board Management:
@pytest.fixture
async def test_board(mcp_session, test_project_id, test_board_name):
"""Auto-managed test board with cleanup"""
result = await mcp_session.call_tool("mcp_kanban_project_board_manager", {
"action": "create_board",
"projectId": test_project_id,
"name": test_board_name
})
yield board_data
# Automatic cleanup after test
Workflow Integration#
The Testing Framework integrates into the Marcus workflow at multiple points:
Development Workflow Integration#
create_project β register_agent β request_next_task β report_progress β report_blocker β finish_task
β β β β β β
Unit Tests Integration Performance Tests Error Testing Recovery Tests E2E Tests
Pre-Development: Future feature tests guide TDD implementation During Development: Unit tests provide rapid feedback Integration Phase: Integration tests verify component interactions Performance Validation: Benchmarking ensures scalability Error Handling: Error framework tests validate recovery paths Deployment: E2E tests verify complete workflows
Test Decision Flowchart Integration#
The framework uses an intelligent test placement system:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β START: New Test Needed β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββ
β Q1: Does it need external services? β
β (Database, API, Network, File System) β
ββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββ
β NO β YES
βΌ βΌ
βββββββββββββββββββββ ββββββββββββββββββββββββββ
β Q2: Single unit? β β Q3: Future feature? β
β (class/function) β β (TDD/unimplemented) β
ββββ¬βββββββββββ¬ββββββ ββββ¬ββββββββββββββββ¬βββββ
β YES β NO β YES β NO
βΌ βΌ βΌ βΌ
tests/unit/ tests/unit/ tests/future_ tests/integration/
[component]/ test_*.py features/ [type]/
What Makes This System Special#
1. Autonomous Agent-Aware Testing#
Unlike traditional testing frameworks, Marcusβs system is designed for autonomous agents:
# tests/fixtures/factories.py
class AgentFactory:
"""Factory for creating WorkerStatus objects for testing"""
@classmethod
def create(cls, **kwargs) -> WorkerStatus:
"""Create agent with realistic autonomous behavior patterns"""
defaults = {
'worker_id': f'agent-{cls._counter:04d}',
'current_tasks': [],
'performance_score': 1.0,
'skills': ['python', 'autonomous-execution'],
'capacity': 40
}
return WorkerStatus(**defaults)
2. Rich Error Context Testing#
Deep integration with the Error Framework provides unprecedented debugging capability:
def test_error_context_propagation(self):
"""Test error context flows through autonomous systems"""
with error_context("task_assignment", agent_id="agent-001"):
try:
raise KanbanIntegrationError(
board_name="test_board",
operation="create_task"
)
except MarcusBaseError as e:
assert e.context.agent_id == "agent-001"
assert e.context.operation == "task_assignment"
3. Real-World Integration Testing#
Tests operate against real external services with automatic cleanup:
@pytest.mark.integration
async def test_real_kanban_workflow(test_board):
"""Test complete workflow with real Kanban backend"""
# Test operates on real board, auto-cleaned after test
task_result = await create_task(test_board["id"], task_data)
progress_result = await update_progress(task_result["id"], 50)
assert progress_result["status"] == "in_progress"
4. Performance-Aware Testing#
Built-in performance benchmarking with scaling analysis:
# tests/performance/benchmark_scaling.py
@dataclass
class BenchmarkResult:
scenario: str
total_connections: int
successful_requests: int
avg_response_time: float
p95_response_time: float
requests_per_second: float
memory_usage_mb: float
cpu_usage_percent: float
5. Future-Driven Development#
TDD support for unimplemented features guides development:
# tests/future_features/ai/core/test_ai_engine.py
class TestMarcusAIEngine:
"""Test the core AI engine with hybrid intelligence"""
async def test_hybrid_decision_making(self, ai_engine):
"""Test AI+rule hybrid decision framework"""
# This test drives implementation of hybrid AI system
decision = await ai_engine.make_hybrid_decision(context)
assert decision.confidence > 0.9
assert decision.reasoning is not None
Technical Implementation#
Test Organization Structure#
tests/
βββ unit/ # 181 tests - 100% passing
β βββ ai/ # AI component tests
β βββ core/ # Core logic tests
β βββ mcp/ # MCP protocol tests
β βββ visualization/ # UI component tests
βββ integration/ # Real service tests
β βββ e2e/ # End-to-end workflows
β βββ api/ # API integrations
β βββ external/ # 3rd party services
β βββ diagnostics/ # Connection/debug tests
βββ performance/ # Benchmarks and load tests
β βββ benchmarks/ # Speed benchmarks
β βββ load/ # Concurrent load tests
βββ future_features/ # TDD for unimplemented
β βββ [mirrors src structure]
βββ fixtures/ # Shared test data
βββ factories.py # Object factories
βββ __init__.py
Async Test Management#
Due to MCP protocol requirements, the framework uses sophisticated async handling:
# conftest.py
@pytest.fixture(scope="session")
def event_loop() -> asyncio.AbstractEventLoop:
"""Session-scoped event loop for MCP connections"""
loop = asyncio.get_event_loop_policy().new_event_loop()
yield loop
loop.close()
# Test implementation
@pytest.mark.anyio
@pytest.mark.parametrize("anyio_backend", ["asyncio"])
async def test_mcp_workflow():
"""Async test with proper MCP protocol handling"""
result = await mcp_client.call_tool("register_agent", params)
assert result is not None
Factory Pattern for Test Data#
Consistent, realistic test data generation:
class TaskFactory:
"""Factory for creating Task objects for testing"""
@classmethod
def create(cls, **kwargs) -> Task:
"""Create task with auto-incrementing ID and realistic defaults"""
cls._counter += 1
defaults = {
'id': f"TASK-{cls._counter:04d}",
'name': f"Task {cls._counter}",
'status': TaskStatus.TODO,
'priority': Priority.MEDIUM,
'estimated_hours': 4.0,
'dependencies': [],
'labels': []
}
defaults.update(kwargs)
return Task(**defaults)
Mock System Architecture#
Comprehensive mocking for external dependencies:
# tests/utils/base.py
class BaseTestCase:
def create_mock_kanban_client(self) -> AsyncMock:
"""Create properly configured mock kanban client"""
client = AsyncMock()
client.get_available_tasks = AsyncMock(return_value=[])
client.update_task_progress = AsyncMock()
client.get_board_summary = AsyncMock(return_value={
'totalCards': 0,
'doneCount': 0,
'inProgressCount': 0
})
return client
Coverage and Quality Assurance#
# Requirements enforced by framework
- Minimum coverage: 80%
- Unit tests must run in < 100ms
- All external dependencies mocked in unit tests
- Integration tests use real services
- Future feature tests guide TDD implementation
Pros and Cons#
Pros#
Autonomous Agent Specialization:
Purpose-built for agent testing scenarios
Rich context tracking for debugging agent interactions
Real-world integration testing with automatic cleanup
Comprehensive Coverage:
Four-layer test organization (unit/integration/performance/future)
181 unit tests with 100% pass rate
Performance benchmarking with scaling analysis
Developer Experience:
Intelligent test placement guidance
Rich factory pattern for test data
Comprehensive fixture ecosystem
Quality Assurance:
80% minimum coverage requirement
Automatic async handling for MCP protocol
Integration with Marcus Error Framework
Cons#
Complexity:
Four-layer organization can be overwhelming for simple projects
Async test management requires understanding of event loops
Factory pattern adds indirection for simple test cases
External Dependencies:
Integration tests require running Kanban MCP server
Performance tests need stable external services
Real-world testing can be flaky due to network issues
Learning Curve:
Test placement decision tree requires understanding
MCP protocol testing has specific requirements
Future feature testing paradigm is non-standard
Maintenance Overhead:
Test board cleanup requires careful lifecycle management
Factory patterns need maintenance as models evolve
Async fixtures are complex to debug
Design Rationale#
Why This Approach Was Chosen#
Autonomous Agent Requirements: Traditional testing frameworks donβt account for autonomous agents that make independent decisions and interact with external services. Marcus needed testing that could handle:
Agent state tracking across operations
Error context propagation through autonomous systems
Real-world integration testing with external boards
MCP Protocol Complexity: The Model Context Protocol requires sophisticated async handling and session management that standard pytest async plugins couldnβt handle reliably:
# Standard pytest-asyncio had introspection issues
# Solution: pytest-anyio with explicit backend selection
@pytest.mark.anyio
@pytest.mark.parametrize("anyio_backend", ["asyncio"])
async def test_mcp_operation():
# Reliable async test execution
TDD for AI Systems: AI system development benefits enormously from writing tests first, as it forces clarification of expected behavior before implementation:
# tests/future_features/ai/core/test_ai_engine.py
async def test_hybrid_decision_making(self, ai_engine):
"""This test drives AI engine development"""
# Define expected AI behavior before implementation
decision = await ai_engine.make_hybrid_decision(context)
assert decision.confidence > 0.9
Four-Layer Organization: The separation into unit/integration/performance/future reflects the realities of autonomous agent development:
Unit: Fast feedback for individual components
Integration: Real-world behavior validation
Performance: Scaling requirements for multi-agent systems
Future: TDD guidance for complex AI features
Future Evolution#
Planned Enhancements#
AI-Powered Test Generation:
# Future: AI generates tests from code changes
class AITestGenerator:
async def generate_tests_for_commit(self, commit_diff):
"""Generate comprehensive tests for code changes"""
analysis = await self.analyze_code_changes(commit_diff)
return await self.generate_test_suite(analysis)
Distributed Testing:
# Future: Multi-node test execution
class DistributedTestRunner:
async def run_tests_across_agents(self, test_suite):
"""Distribute tests across multiple Marcus agents"""
results = await self.coordinate_test_execution(test_suite)
return self.aggregate_results(results)
Real-Time Test Feedback:
# Future: Live test results during development
class LiveTestRunner:
async def watch_code_changes(self):
"""Run relevant tests on code changes"""
async for change in self.watch_filesystem():
relevant_tests = await self.find_affected_tests(change)
await self.run_tests_live(relevant_tests)
Predictive Test Selection:
# Future: ML-powered test selection
class PredictiveTestSelector:
async def select_optimal_tests(self, change_context):
"""Use ML to select tests most likely to catch issues"""
risk_analysis = await self.analyze_change_risk(change_context)
return await self.select_high_value_tests(risk_analysis)
Architecture Evolution#
Microservice Test Architecture: As Marcus scales, testing will evolve to support distributed microservice architectures with service-specific test suites and cross-service integration testing.
Continuous Testing Pipeline: Integration with CI/CD for continuous test execution, automated test generation, and intelligent test result analysis.
Performance Prediction: ML models to predict performance impacts from code changes and automatically trigger appropriate performance test suites.
Task Complexity Handling#
Simple Tasks#
For simple tasks like configuration updates or single-function implementations:
# Simple task testing approach
def test_config_update():
"""Test simple configuration change"""
config = load_config()
config.update_setting("max_agents", 10)
assert config.max_agents == 10
Characteristics:
Single unit test
Minimal mocking
Fast execution (< 50ms)
Direct assertion
Complex Tasks#
For complex tasks like multi-agent coordination or AI system integration:
# Complex task testing approach
@pytest.mark.integration
@pytest.mark.e2e
class TestMultiAgentCoordination:
async def test_coordinated_task_execution(self):
"""Test complex multi-agent task coordination"""
# Setup multiple agents
agents = [AgentFactory.create() for _ in range(3)]
# Create interdependent tasks
tasks = TaskFactory.create_dependency_chain(3)
# Test coordination
coordinator = TaskCoordinator()
results = await coordinator.execute_coordinated_tasks(agents, tasks)
# Verify coordination behavior
assert all(r.success for r in results)
assert results[0].completion_time < results[1].start_time
Characteristics:
Multiple test layers (unit + integration + e2e)
Complex setup with factories
Real service integration
Multi-step verification
AI-Driven Tasks#
For AI-powered tasks requiring intelligence and decision-making:
# AI task testing approach
@pytest.mark.integration
@pytest.mark.ai
class TestAITaskAnalysis:
async def test_intelligent_task_breakdown(self, ai_engine):
"""Test AI-powered task analysis and breakdown"""
# Provide complex project description
project_description = load_test_prd("complex_ecommerce.md")
# Test AI analysis
analysis = await ai_engine.analyze_and_breakdown(project_description)
# Verify AI reasoning
assert analysis.confidence > 0.8
assert len(analysis.subtasks) >= 5
assert analysis.estimated_complexity == "high"
# Test dependency inference
dependencies = analysis.inferred_dependencies
assert any(d.relationship == "blocks" for d in dependencies)
Board-Specific Considerations#
Kanban Board Integration#
The testing framework has special handling for different Kanban board configurations:
# Board-specific test configuration
@pytest.fixture
def kanban_board_config():
"""Configuration for different board types"""
return {
"simple_board": {
"columns": ["To Do", "In Progress", "Done"],
"complexity": "low"
},
"advanced_board": {
"columns": ["Backlog", "Analysis", "Development", "Testing", "Review", "Done"],
"complexity": "high"
}
}
@pytest.mark.parametrize("board_type", ["simple_board", "advanced_board"])
async def test_board_specific_behavior(board_type, kanban_board_config):
"""Test behavior adapts to board configuration"""
config = kanban_board_config[board_type]
board = await create_test_board(config)
# Test that Marcus adapts to board structure
task_flow = await analyze_board_task_flow(board)
assert len(task_flow.stages) == len(config["columns"])
Board Quality Testing#
Special tests for board quality and structure validation:
# tests/unit/detection/test_board_analyzer.py
class TestBoardAnalyzer:
def test_board_quality_assessment(self):
"""Test board quality analysis"""
analyzer = BoardAnalyzer()
# Test high-quality board
good_board = create_well_structured_board()
quality = analyzer.assess_board_quality(good_board)
assert quality.score > 0.8
assert quality.issues == []
# Test low-quality board
poor_board = create_poorly_structured_board()
quality = analyzer.assess_board_quality(poor_board)
assert quality.score < 0.4
assert len(quality.issues) > 0
Cato Integration#
Currently, the Marcus Testing Framework doesnβt have direct Cato integration, but itβs designed to support it:
Planned Cato Integration#
# Future Cato integration
class CatoTestIntegration:
"""Integration layer for Cato testing"""
async def test_cato_decision_quality(self, decision_context):
"""Test Cato's decision-making quality"""
cato = CatoEngine()
decision = await cato.make_decision(decision_context)
# Test decision quality metrics
assert decision.confidence > 0.8
assert decision.reasoning_steps >= 3
assert decision.considers_alternatives
async def test_cato_marcus_collaboration(self, marcus_context):
"""Test collaboration between Cato and Marcus"""
collaboration = CatoMarcusCollaboration()
result = await collaboration.coordinate_decision(marcus_context)
assert result.marcus_execution_plan is not None
assert result.cato_oversight_active is True
Integration Architecture#
Marcus Testing Framework
βββ Core Testing (Current)
βββ MCP Integration (Current)
βββ Kanban Integration (Current)
βββ Cato Integration (Planned)
βββ Decision Quality Tests
βββ Collaboration Tests
βββ Override Scenario Tests
Typical Scenario Integration#
The Testing Framework integrates into the standard Marcus workflow at each phase:
1. create_project Phase#
# tests/integration/project_creation/test_create_project_workflow.py
@pytest.mark.integration
async def test_complete_project_creation():
"""Test project creation end-to-end"""
project_spec = {
"name": "E-commerce Platform",
"description": "Full-featured online store",
"complexity": "high"
}
result = await create_project(project_spec)
assert result.project_id is not None
assert len(result.initial_tasks) >= 10
assert result.board_created is True
Testing Focus: Project initialization, task generation, board setup
2. register_agent Phase#
# tests/integration/e2e/test_marcus_workflows.py
@pytest.mark.integration
async def test_agent_registration_workflow():
"""Test agent registration and capability matching"""
agent_spec = {
"agent_id": "agent-001",
"name": "Senior Developer",
"role": "Full Stack Developer",
"skills": ["python", "react", "postgresql"]
}
result = await register_agent(agent_spec)
assert result.registered is True
assert result.capability_score > 0.7
assert result.initial_assignment is not None
Testing Focus: Agent registration, skill matching, initial assignment
3. request_next_task Phase#
async def test_intelligent_task_assignment():
"""Test AI-powered task assignment logic"""
agent = AgentFactory.create(skills=["python", "testing"])
available_tasks = TaskFactory.create_batch(5, priority="high")
assignment = await request_next_task(agent.worker_id)
assert assignment.task_id is not None
assert assignment.skill_match_score > 0.8
assert assignment.estimated_completion_time is not None
Testing Focus: Task matching, AI assignment logic, dependency resolution
4. report_progress Phase#
async def test_progress_reporting_workflow():
"""Test progress reporting and tracking"""
task = TaskFactory.create(status=TaskStatus.IN_PROGRESS)
progress_data = {
"task_id": task.id,
"progress": 75,
"status": "in_progress",
"message": "API endpoints implemented, working on frontend"
}
result = await report_progress(progress_data)
assert result.progress_recorded is True
assert result.next_milestone is not None
assert result.risk_assessment.level == "low"
Testing Focus: Progress tracking, milestone detection, risk assessment
5. report_blocker Phase#
async def test_blocker_reporting_and_resolution():
"""Test blocker reporting and AI-powered resolution"""
blocker_data = {
"task_id": "TASK-001",
"agent_id": "agent-001",
"blocker_description": "Database connection failing in test environment",
"severity": "medium"
}
result = await report_blocker(blocker_data)
assert result.blocker_recorded is True
assert len(result.ai_suggestions) >= 3
assert result.escalation_needed is False
Testing Focus: Blocker analysis, AI suggestions, escalation logic
6. finish_task Phase#
async def test_task_completion_workflow():
"""Test task completion and knowledge capture"""
completion_data = {
"task_id": "TASK-001",
"agent_id": "agent-001",
"status": "completed",
"completion_notes": "Feature implemented with comprehensive tests",
"artifacts": ["src/api/users.py", "tests/test_users.py"]
}
result = await finish_task(completion_data)
assert result.task_completed is True
assert result.knowledge_captured is True
assert result.next_task_suggested is not None
Testing Focus: Completion validation, knowledge capture, workflow continuation
End-to-End Workflow Testing#
@pytest.mark.integration
@pytest.mark.e2e
@pytest.mark.slow
async def test_complete_marcus_workflow():
"""Test complete workflow from project creation to task completion"""
# 1. Create project
project = await create_project(sample_project_spec)
# 2. Register agent
agent_result = await register_agent(sample_agent_spec)
# 3. Request first task
task_assignment = await request_next_task(agent_result.agent_id)
# 4. Report progress
await report_progress({
"task_id": task_assignment.task_id,
"progress": 50,
"status": "in_progress"
})
# 5. Complete task
completion = await finish_task({
"task_id": task_assignment.task_id,
"status": "completed"
})
# Verify complete workflow
assert project.created is True
assert agent_result.registered is True
assert task_assignment.assigned is True
assert completion.completed is True
# Verify system state
project_state = await get_project_status(project.project_id)
assert project_state.tasks_completed == 1
assert project_state.agents_active == 1
This comprehensive end-to-end test validates the entire Marcus workflow, ensuring all systems work together correctly and that the testing framework provides visibility into each phase of the autonomous agent lifecycle.
The Marcus Testing Framework represents a sophisticated approach to testing autonomous agent systems, providing the structure, tools, and patterns necessary to ensure reliable operation in complex, multi-agent environments. Its deep integration with Marcusβs core systems, combined with its support for TDD and real-world testing scenarios, makes it an essential component of the Marcus ecosystem.