Marcus Testing Framework#

Table of Contents#

Overview
Architecture
Ecosystem Integration
Workflow Integration
What Makes This System Special
Technical Implementation
Pros and Cons
Design Rationale
Future Evolution
Task Complexity Handling
Board-Specific Considerations
Cato Integration
Typical Scenario Integration

Overview#

The Marcus Testing Framework is a comprehensive, multi-layered testing system designed specifically for autonomous agent environments. It provides structured test organization, rich error handling, and intelligent test automation to ensure reliable operation of the Marcus ecosystem.

What the System Does#

The Testing Framework provides:

Structured Test Organization: Clear separation of unit, integration, performance, and future feature tests
Autonomous Agent Testing: Specialized fixtures and utilities for testing agent interactions
Rich Error Context: Integration with Marcus Error Framework for detailed failure analysis
Intelligent Test Discovery: Automated test placement based on dependency analysis
Performance Benchmarking: Comprehensive performance testing with scaling metrics
TDD Support: Future features testing for test-driven development workflows

System Architecture#

Marcus Testing Framework
├── Test Organization Layer
│   ├── Unit Tests (Isolated, < 100ms)
│   ├── Integration Tests (Real services)
│   ├── Performance Tests (Benchmarking)
│   └── Future Features (TDD)
├── Test Infrastructure Layer
│   ├── Fixtures & Factories
│   ├── Mock Systems
│   └── Base Test Classes
├── Automation Layer
│   ├── Test Discovery
│   ├── Async Test Management
│   └── Coverage Analysis
└── Integration Layer
    ├── MCP Protocol Testing
    ├── Kanban Integration Testing
    └── Error Framework Testing

Ecosystem Integration#

Core Marcus Systems Integration#

The Testing Framework deeply integrates with all Marcus core systems:

Error Framework Integration:

# tests/unit/core/test_error_framework.py
from src.core.error_framework import (
    MarcusBaseError, ErrorContext, RemediationSuggestion,
    KanbanIntegrationError, AIProviderError
)

class TestErrorFramework:
    def test_error_context_creation(self):
        """Test creating error context with agent tracking"""
        context = ErrorContext(
            operation="task_assignment",
            agent_id="agent-001",
            task_id="TASK-123"
        )
        assert context.correlation_id is not None

MCP Server Integration:

# tests/unit/mcp/test_marcus_server_complete.py
@pytest.mark.anyio
@pytest.mark.parametrize("anyio_backend", ["asyncio"])
async def test_mcp_tool_execution():
    """Test MCP tool execution with proper context"""
    server = MarcusServer()
    result = await server.call_tool("register_agent", {
        "agent_id": "test-001",
        "name": "Test Agent",
        "role": "Developer"
    })
    assert result["success"] is True

AI Engine Integration:

# tests/integration/ai/test_prd_parser_real_ai.py
@pytest.mark.integration
async def test_ai_prd_analysis():
    """Test AI engine with real provider integration"""
    engine = AIAnalysisEngine()
    result = await engine.analyze_prd(sample_prd_text)
    assert result.confidence > 0.8

External System Integration#

Kanban Provider Testing:

# Shared fixture in conftest.py
@pytest.fixture
async def mcp_session() -> AsyncGenerator[ClientSession, None]:
    """MCP session connected to Kanban server"""
    server_params = StdioServerParameters(
        command="/opt/homebrew/bin/node",
        args=["/Users/lwgray/dev/kanban-mcp/dist/index.js"],
        env={
            "PLANKA_BASE_URL": "http://localhost:3333",
            "PLANKA_AGENT_EMAIL": "demo@demo.demo",
            "PLANKA_AGENT_PASSWORD": "demo"  # pragma: allowlist secret
        }
    )
    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            yield session

Test Board Management:

@pytest.fixture
async def test_board(mcp_session, test_project_id, test_board_name):
    """Auto-managed test board with cleanup"""
    result = await mcp_session.call_tool("mcp_kanban_project_board_manager", {
        "action": "create_board",
        "projectId": test_project_id,
        "name": test_board_name
    })
    yield board_data
    # Automatic cleanup after test

Workflow Integration#

The Testing Framework integrates into the Marcus workflow at multiple points:

Development Workflow Integration#

create_project → register_agent → request_next_task → report_progress → report_blocker → finish_task
      ↓               ↓                ↓                    ↓              ↓             ↓
   Unit Tests    Integration     Performance Tests    Error Testing   Recovery Tests  E2E Tests

Pre-Development: Future feature tests guide TDD implementation During Development: Unit tests provide rapid feedback Integration Phase: Integration tests verify component interactions Performance Validation: Benchmarking ensures scalability Error Handling: Error framework tests validate recovery paths Deployment: E2E tests verify complete workflows

Test Decision Flowchart Integration#

The framework uses an intelligent test placement system:

┌─────────────────────────────────────────────────────────────┐
│                    START: New Test Needed                    │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
        ┌──────────────────────────────────────────┐
        │  Q1: Does it need external services?     │
        │  (Database, API, Network, File System)   │
        └────────────┬─────────────────┬───────────┘
                     │ NO              │ YES
                     ▼                 ▼
         ┌───────────────────┐   ┌────────────────────────┐
         │ Q2: Single unit?  │   │ Q3: Future feature?    │
         │ (class/function)  │   │ (TDD/unimplemented)    │
         └──┬──────────┬─────┘   └──┬───────────────┬────┘
            │ YES      │ NO         │ YES           │ NO
            ▼          ▼            ▼               ▼
     tests/unit/    tests/unit/  tests/future_   tests/integration/
     [component]/   test_*.py   features/       [type]/

What Makes This System Special#

1. Autonomous Agent-Aware Testing#

Unlike traditional testing frameworks, Marcus’s system is designed for autonomous agents:

# tests/fixtures/factories.py
class AgentFactory:
    """Factory for creating WorkerStatus objects for testing"""

    @classmethod
    def create(cls, **kwargs) -> WorkerStatus:
        """Create agent with realistic autonomous behavior patterns"""
        defaults = {
            'worker_id': f'agent-{cls._counter:04d}',
            'current_tasks': [],
            'performance_score': 1.0,
            'skills': ['python', 'autonomous-execution'],
            'capacity': 40
        }
        return WorkerStatus(**defaults)

2. Rich Error Context Testing#

Deep integration with the Error Framework provides unprecedented debugging capability:

def test_error_context_propagation(self):
    """Test error context flows through autonomous systems"""
    with error_context("task_assignment", agent_id="agent-001"):
        try:
            raise KanbanIntegrationError(
                board_name="test_board",
                operation="create_task"
            )
        except MarcusBaseError as e:
            assert e.context.agent_id == "agent-001"
            assert e.context.operation == "task_assignment"

3. Real-World Integration Testing#

Tests operate against real external services with automatic cleanup:

@pytest.mark.integration
async def test_real_kanban_workflow(test_board):
    """Test complete workflow with real Kanban backend"""
    # Test operates on real board, auto-cleaned after test
    task_result = await create_task(test_board["id"], task_data)
    progress_result = await update_progress(task_result["id"], 50)
    assert progress_result["status"] == "in_progress"

4. Performance-Aware Testing#

Built-in performance benchmarking with scaling analysis:

# tests/performance/benchmark_scaling.py
@dataclass
class BenchmarkResult:
    scenario: str
    total_connections: int
    successful_requests: int
    avg_response_time: float
    p95_response_time: float
    requests_per_second: float
    memory_usage_mb: float
    cpu_usage_percent: float

5. Future-Driven Development#

TDD support for unimplemented features guides development:

# tests/future_features/ai/core/test_ai_engine.py
class TestMarcusAIEngine:
    """Test the core AI engine with hybrid intelligence"""

    async def test_hybrid_decision_making(self, ai_engine):
        """Test AI+rule hybrid decision framework"""
        # This test drives implementation of hybrid AI system
        decision = await ai_engine.make_hybrid_decision(context)
        assert decision.confidence > 0.9
        assert decision.reasoning is not None

Technical Implementation#

Test Organization Structure#

tests/
├── unit/                        # 181 tests - 100% passing
│   ├── ai/                     # AI component tests
│   ├── core/                   # Core logic tests
│   ├── mcp/                    # MCP protocol tests
│   └── visualization/          # UI component tests
├── integration/                 # Real service tests
│   ├── e2e/                   # End-to-end workflows
│   ├── api/                   # API integrations
│   ├── external/              # 3rd party services
│   └── diagnostics/           # Connection/debug tests
├── performance/                 # Benchmarks and load tests
│   ├── benchmarks/            # Speed benchmarks
│   └── load/                  # Concurrent load tests
├── future_features/            # TDD for unimplemented
│   └── [mirrors src structure]
└── fixtures/                   # Shared test data
    ├── factories.py           # Object factories
    └── __init__.py

Async Test Management#

Due to MCP protocol requirements, the framework uses sophisticated async handling:

# conftest.py
@pytest.fixture(scope="session")
def event_loop() -> asyncio.AbstractEventLoop:
    """Session-scoped event loop for MCP connections"""
    loop = asyncio.get_event_loop_policy().new_event_loop()
    yield loop
    loop.close()

# Test implementation
@pytest.mark.anyio
@pytest.mark.parametrize("anyio_backend", ["asyncio"])
async def test_mcp_workflow():
    """Async test with proper MCP protocol handling"""
    result = await mcp_client.call_tool("register_agent", params)
    assert result is not None

Factory Pattern for Test Data#

Consistent, realistic test data generation:

class TaskFactory:
    """Factory for creating Task objects for testing"""

    @classmethod
    def create(cls, **kwargs) -> Task:
        """Create task with auto-incrementing ID and realistic defaults"""
        cls._counter += 1
        defaults = {
            'id': f"TASK-{cls._counter:04d}",
            'name': f"Task {cls._counter}",
            'status': TaskStatus.TODO,
            'priority': Priority.MEDIUM,
            'estimated_hours': 4.0,
            'dependencies': [],
            'labels': []
        }
        defaults.update(kwargs)
        return Task(**defaults)

Mock System Architecture#

Comprehensive mocking for external dependencies:

# tests/utils/base.py
class BaseTestCase:
    def create_mock_kanban_client(self) -> AsyncMock:
        """Create properly configured mock kanban client"""
        client = AsyncMock()
        client.get_available_tasks = AsyncMock(return_value=[])
        client.update_task_progress = AsyncMock()
        client.get_board_summary = AsyncMock(return_value={
            'totalCards': 0,
            'doneCount': 0,
            'inProgressCount': 0
        })
        return client

Coverage and Quality Assurance#

# Requirements enforced by framework
- Minimum coverage: 80%
- Unit tests must run in < 100ms
- All external dependencies mocked in unit tests
- Integration tests use real services
- Future feature tests guide TDD implementation

Pros and Cons#

Pros#

Autonomous Agent Specialization:

Purpose-built for agent testing scenarios
Rich context tracking for debugging agent interactions
Real-world integration testing with automatic cleanup

Comprehensive Coverage:

Four-layer test organization (unit/integration/performance/future)
181 unit tests with 100% pass rate
Performance benchmarking with scaling analysis

Developer Experience:

Intelligent test placement guidance
Rich factory pattern for test data
Comprehensive fixture ecosystem

Quality Assurance:

80% minimum coverage requirement
Automatic async handling for MCP protocol
Integration with Marcus Error Framework

Cons#

Complexity:

Four-layer organization can be overwhelming for simple projects
Async test management requires understanding of event loops
Factory pattern adds indirection for simple test cases

External Dependencies:

Integration tests require running Kanban MCP server
Performance tests need stable external services
Real-world testing can be flaky due to network issues

Learning Curve:

Test placement decision tree requires understanding
MCP protocol testing has specific requirements
Future feature testing paradigm is non-standard

Maintenance Overhead:

Test board cleanup requires careful lifecycle management
Factory patterns need maintenance as models evolve
Async fixtures are complex to debug

Design Rationale#

Why This Approach Was Chosen#

Autonomous Agent Requirements: Traditional testing frameworks don’t account for autonomous agents that make independent decisions and interact with external services. Marcus needed testing that could handle:

Agent state tracking across operations
Error context propagation through autonomous systems
Real-world integration testing with external boards

MCP Protocol Complexity: The Model Context Protocol requires sophisticated async handling and session management that standard pytest async plugins couldn’t handle reliably:

# Standard pytest-asyncio had introspection issues
# Solution: pytest-anyio with explicit backend selection
@pytest.mark.anyio
@pytest.mark.parametrize("anyio_backend", ["asyncio"])
async def test_mcp_operation():
    # Reliable async test execution

TDD for AI Systems: AI system development benefits enormously from writing tests first, as it forces clarification of expected behavior before implementation:

# tests/future_features/ai/core/test_ai_engine.py
async def test_hybrid_decision_making(self, ai_engine):
    """This test drives AI engine development"""
    # Define expected AI behavior before implementation
    decision = await ai_engine.make_hybrid_decision(context)
    assert decision.confidence > 0.9

Four-Layer Organization: The separation into unit/integration/performance/future reflects the realities of autonomous agent development:

Unit: Fast feedback for individual components
Integration: Real-world behavior validation
Performance: Scaling requirements for multi-agent systems
Future: TDD guidance for complex AI features

Future Evolution#

Planned Enhancements#

AI-Powered Test Generation:

# Future: AI generates tests from code changes
class AITestGenerator:
    async def generate_tests_for_commit(self, commit_diff):
        """Generate comprehensive tests for code changes"""
        analysis = await self.analyze_code_changes(commit_diff)
        return await self.generate_test_suite(analysis)

Distributed Testing:

# Future: Multi-node test execution
class DistributedTestRunner:
    async def run_tests_across_agents(self, test_suite):
        """Distribute tests across multiple Marcus agents"""
        results = await self.coordinate_test_execution(test_suite)
        return self.aggregate_results(results)

Real-Time Test Feedback:

# Future: Live test results during development
class LiveTestRunner:
    async def watch_code_changes(self):
        """Run relevant tests on code changes"""
        async for change in self.watch_filesystem():
            relevant_tests = await self.find_affected_tests(change)
            await self.run_tests_live(relevant_tests)

Predictive Test Selection:

# Future: ML-powered test selection
class PredictiveTestSelector:
    async def select_optimal_tests(self, change_context):
        """Use ML to select tests most likely to catch issues"""
        risk_analysis = await self.analyze_change_risk(change_context)
        return await self.select_high_value_tests(risk_analysis)

Architecture Evolution#

Microservice Test Architecture: As Marcus scales, testing will evolve to support distributed microservice architectures with service-specific test suites and cross-service integration testing.

Continuous Testing Pipeline: Integration with CI/CD for continuous test execution, automated test generation, and intelligent test result analysis.

Performance Prediction: ML models to predict performance impacts from code changes and automatically trigger appropriate performance test suites.

Task Complexity Handling#

Simple Tasks#

For simple tasks like configuration updates or single-function implementations:

# Simple task testing approach
def test_config_update():
    """Test simple configuration change"""
    config = load_config()
    config.update_setting("max_agents", 10)
    assert config.max_agents == 10

Characteristics:

Single unit test
Minimal mocking
Fast execution (< 50ms)
Direct assertion

Complex Tasks#

For complex tasks like multi-agent coordination or AI system integration:

# Complex task testing approach
@pytest.mark.integration
@pytest.mark.e2e
class TestMultiAgentCoordination:
    async def test_coordinated_task_execution(self):
        """Test complex multi-agent task coordination"""
        # Setup multiple agents
        agents = [AgentFactory.create() for _ in range(3)]

        # Create interdependent tasks
        tasks = TaskFactory.create_dependency_chain(3)

        # Test coordination
        coordinator = TaskCoordinator()
        results = await coordinator.execute_coordinated_tasks(agents, tasks)

        # Verify coordination behavior
        assert all(r.success for r in results)
        assert results[0].completion_time < results[1].start_time

Characteristics:

Multiple test layers (unit + integration + e2e)
Complex setup with factories
Real service integration
Multi-step verification

AI-Driven Tasks#

For AI-powered tasks requiring intelligence and decision-making:

# AI task testing approach
@pytest.mark.integration
@pytest.mark.ai
class TestAITaskAnalysis:
    async def test_intelligent_task_breakdown(self, ai_engine):
        """Test AI-powered task analysis and breakdown"""
        # Provide complex project description
        project_description = load_test_prd("complex_ecommerce.md")

        # Test AI analysis
        analysis = await ai_engine.analyze_and_breakdown(project_description)

        # Verify AI reasoning
        assert analysis.confidence > 0.8
        assert len(analysis.subtasks) >= 5
        assert analysis.estimated_complexity == "high"

        # Test dependency inference
        dependencies = analysis.inferred_dependencies
        assert any(d.relationship == "blocks" for d in dependencies)

Board-Specific Considerations#

Kanban Board Integration#

The testing framework has special handling for different Kanban board configurations:

# Board-specific test configuration
@pytest.fixture
def kanban_board_config():
    """Configuration for different board types"""
    return {
        "simple_board": {
            "columns": ["To Do", "In Progress", "Done"],
            "complexity": "low"
        },
        "advanced_board": {
            "columns": ["Backlog", "Analysis", "Development", "Testing", "Review", "Done"],
            "complexity": "high"
        }
    }

@pytest.mark.parametrize("board_type", ["simple_board", "advanced_board"])
async def test_board_specific_behavior(board_type, kanban_board_config):
    """Test behavior adapts to board configuration"""
    config = kanban_board_config[board_type]
    board = await create_test_board(config)

    # Test that Marcus adapts to board structure
    task_flow = await analyze_board_task_flow(board)
    assert len(task_flow.stages) == len(config["columns"])

Board Quality Testing#

Special tests for board quality and structure validation:

# tests/unit/detection/test_board_analyzer.py
class TestBoardAnalyzer:
    def test_board_quality_assessment(self):
        """Test board quality analysis"""
        analyzer = BoardAnalyzer()

        # Test high-quality board
        good_board = create_well_structured_board()
        quality = analyzer.assess_board_quality(good_board)
        assert quality.score > 0.8
        assert quality.issues == []

        # Test low-quality board
        poor_board = create_poorly_structured_board()
        quality = analyzer.assess_board_quality(poor_board)
        assert quality.score < 0.4
        assert len(quality.issues) > 0

Cato Integration#

Currently, the Marcus Testing Framework doesn’t have direct Cato integration, but it’s designed to support it:

Planned Cato Integration#

# Future Cato integration
class CatoTestIntegration:
    """Integration layer for Cato testing"""

    async def test_cato_decision_quality(self, decision_context):
        """Test Cato's decision-making quality"""
        cato = CatoEngine()
        decision = await cato.make_decision(decision_context)

        # Test decision quality metrics
        assert decision.confidence > 0.8
        assert decision.reasoning_steps >= 3
        assert decision.considers_alternatives

    async def test_cato_marcus_collaboration(self, marcus_context):
        """Test collaboration between Cato and Marcus"""
        collaboration = CatoMarcusCollaboration()
        result = await collaboration.coordinate_decision(marcus_context)

        assert result.marcus_execution_plan is not None
        assert result.cato_oversight_active is True

Integration Architecture#

Marcus Testing Framework
├── Core Testing (Current)
├── MCP Integration (Current)
├── Kanban Integration (Current)
└── Cato Integration (Planned)
    ├── Decision Quality Tests
    ├── Collaboration Tests
    └── Override Scenario Tests

Typical Scenario Integration#

The Testing Framework integrates into the standard Marcus workflow at each phase:

1. create_project Phase#

# tests/integration/project_creation/test_create_project_workflow.py
@pytest.mark.integration
async def test_complete_project_creation():
    """Test project creation end-to-end"""
    project_spec = {
        "name": "E-commerce Platform",
        "description": "Full-featured online store",
        "complexity": "high"
    }

    result = await create_project(project_spec)

    assert result.project_id is not None
    assert len(result.initial_tasks) >= 10
    assert result.board_created is True

Testing Focus: Project initialization, task generation, board setup

2. register_agent Phase#

# tests/integration/e2e/test_marcus_workflows.py
@pytest.mark.integration
async def test_agent_registration_workflow():
    """Test agent registration and capability matching"""
    agent_spec = {
        "agent_id": "agent-001",
        "name": "Senior Developer",
        "role": "Full Stack Developer",
        "skills": ["python", "react", "postgresql"]
    }

    result = await register_agent(agent_spec)

    assert result.registered is True
    assert result.capability_score > 0.7
    assert result.initial_assignment is not None

Testing Focus: Agent registration, skill matching, initial assignment

3. request_next_task Phase#

async def test_intelligent_task_assignment():
    """Test AI-powered task assignment logic"""
    agent = AgentFactory.create(skills=["python", "testing"])
    available_tasks = TaskFactory.create_batch(5, priority="high")

    assignment = await request_next_task(agent.worker_id)

    assert assignment.task_id is not None
    assert assignment.skill_match_score > 0.8
    assert assignment.estimated_completion_time is not None

Testing Focus: Task matching, AI assignment logic, dependency resolution

4. report_progress Phase#

async def test_progress_reporting_workflow():
    """Test progress reporting and tracking"""
    task = TaskFactory.create(status=TaskStatus.IN_PROGRESS)
    progress_data = {
        "task_id": task.id,
        "progress": 75,
        "status": "in_progress",
        "message": "API endpoints implemented, working on frontend"
    }

    result = await report_progress(progress_data)

    assert result.progress_recorded is True
    assert result.next_milestone is not None
    assert result.risk_assessment.level == "low"

Testing Focus: Progress tracking, milestone detection, risk assessment

5. report_blocker Phase#

async def test_blocker_reporting_and_resolution():
    """Test blocker reporting and AI-powered resolution"""
    blocker_data = {
        "task_id": "TASK-001",
        "agent_id": "agent-001",
        "blocker_description": "Database connection failing in test environment",
        "severity": "medium"
    }

    result = await report_blocker(blocker_data)

    assert result.blocker_recorded is True
    assert len(result.ai_suggestions) >= 3
    assert result.escalation_needed is False

Testing Focus: Blocker analysis, AI suggestions, escalation logic

6. finish_task Phase#

async def test_task_completion_workflow():
    """Test task completion and knowledge capture"""
    completion_data = {
        "task_id": "TASK-001",
        "agent_id": "agent-001",
        "status": "completed",
        "completion_notes": "Feature implemented with comprehensive tests",
        "artifacts": ["src/api/users.py", "tests/test_users.py"]
    }

    result = await finish_task(completion_data)

    assert result.task_completed is True
    assert result.knowledge_captured is True
    assert result.next_task_suggested is not None

Testing Focus: Completion validation, knowledge capture, workflow continuation

End-to-End Workflow Testing#

@pytest.mark.integration
@pytest.mark.e2e
@pytest.mark.slow
async def test_complete_marcus_workflow():
    """Test complete workflow from project creation to task completion"""

    # 1. Create project
    project = await create_project(sample_project_spec)

    # 2. Register agent
    agent_result = await register_agent(sample_agent_spec)

    # 3. Request first task
    task_assignment = await request_next_task(agent_result.agent_id)

    # 4. Report progress
    await report_progress({
        "task_id": task_assignment.task_id,
        "progress": 50,
        "status": "in_progress"
    })

    # 5. Complete task
    completion = await finish_task({
        "task_id": task_assignment.task_id,
        "status": "completed"
    })

    # Verify complete workflow
    assert project.created is True
    assert agent_result.registered is True
    assert task_assignment.assigned is True
    assert completion.completed is True

    # Verify system state
    project_state = await get_project_status(project.project_id)
    assert project_state.tasks_completed == 1
    assert project_state.agents_active == 1

This comprehensive end-to-end test validates the entire Marcus workflow, ensuring all systems work together correctly and that the testing framework provides visibility into each phase of the autonomous agent lifecycle.

The Marcus Testing Framework represents a sophisticated approach to testing autonomous agent systems, providing the structure, tools, and patterns necessary to ensure reliable operation in complex, multi-agent environments. Its deep integration with Marcus’s core systems, combined with its support for TDD and real-world testing scenarios, makes it an essential component of the Marcus ecosystem.