Setting Up Local LLMs with Marcus#

Run Marcus end-to-end on your own hardware โ€” no API keys, no usage costs. This guide covers picking models, configuring Ollama, and running enough capacity to keep multiple agents busy in parallel.

What you need to set up#

Marcus is multi-agent. Two distinct LLM roles must both work:

Role

What it does

Hard requirement

Planner

Decomposes a project description into a task graph on the board. Marcus calls it once at create_project time.

Strong instruction-following + structured-output reliability

Workers

Actual coding agents (Claude Code, Codex, Aider, custom). Each pulls tasks from the board and writes code.

Must support tool / function calling โ€” Marcus and MCP both depend on it

โš ๏ธ Worker models without tool-calling will silently fail. They canโ€™t invoke request_next_task, report_task_progress, log_artifact, etc. If you pick a worker model, verify it advertises tool-calling support on its model card.

Quick start#

1. Install Ollama#

curl -fsSL https://ollama.com/install.sh | sh
# Or download from https://ollama.com/download

2. Pull a model#

# Apple Silicon โ€” best dual-role pick (planner + workers)
ollama pull qwen3.5:35b-a3b-coding-nvfp4

# Or, the lowest known-working planner for modest hardware
ollama pull qwen2.5-coder:7b

# Or, a larger planner option
ollama pull ministral:14b

3. Point Marcus at it#

Edit config_marcus.json:

{
  "ai": {
    "provider": "local",
    "enabled": true,
    "local_model": "qwen3.5:35b-a3b-coding-nvfp4",
    "local_url": "http://localhost:11434/v1",
    "local_key": "none"
  }
}

Or override with environment variables (these win over config_marcus.json):

export MARCUS_LLM_PROVIDER=local
export MARCUS_LOCAL_LLM_PATH=qwen3.5:35b-a3b-coding-nvfp4
export MARCUS_LOCAL_LLM_URL=http://localhost:11434/v1

4. Start Marcus#

./marcus start
./marcus board   # check tasks land on the board

5. Wire your workers#

Each worker is a coding agent โ€” most commonly Claude Code, but any MCP-compatible agent works. Point each worker at its own Ollama endpoint (see โ€œRunning multiple workers in parallelโ€ above) and confirm the model supports tool calling.

Complete configuration example#

{
  "auto_find_board": false,
  "kanban": {
    "provider": "sqlite",
    "sqlite_db_path": "./data/kanban.db",
    "sqlite_attachments_dir": "./data/attachments"
  },
  "ai": {
    "provider": "local",
    "enabled": true,
    "local_model": "qwen2.5-coder:7b",
    "local_url": "http://localhost:11434/v1",
    "local_key": "none",
    "anthropic_api_key": "",
    "openai_api_key": ""
  },
  "features": {
    "events": true,
    "context": true,
    "memory": false,
    "visibility": false
  }
}

Advanced#

Non-Ollama OpenAI-compatible servers#

Anything that speaks the OpenAI API works (llama.cpp server, LocalAI, text-generation-webui, vLLM):

{
  "ai": {
    "provider": "local",
    "local_model": "your-model",
    "local_url": "http://localhost:8080/v1",
    "local_key": "your-api-key-if-needed"
  }
}

Configuration priority#

  1. Environment variables (MARCUS_*)

  2. config_marcus.json

  3. Built-in defaults

Ollama performance knobs#

export OLLAMA_NUM_CTX=8192        # bigger context window
export OLLAMA_NUM_PARALLEL=4      # concurrent requests per instance
export OLLAMA_KEEP_ALIVE=30m      # keep model resident between calls

Local-provider request timeout is 120s by default.

Switching back to cloud#

export MARCUS_LLM_PROVIDER=anthropic   # or openai

Or set "ai.provider" in config_marcus.json.

Troubleshooting#

Failed to connect to local LLM server

  • ollama list โ€” is Ollama actually running?

  • curl http://localhost:11434/api/tags โ€” does it answer?

  • Did you pull the model? ollama pull <model>

Worker silently does nothing / never calls request_next_task

  • The model likely lacks tool-calling support. Switch to a model whose card lists Tools as a capability.

Plans come back malformed / empty

  • Your planner model is too small or too quantized. Try qwen2.5-coder:7b at Q5 minimum.

Second worker stalls when first is busy

  • One Ollama instance, no parallelism. Set OLLAMA_NUM_PARALLEL or run a second ollama serve on a different port.

Slow responses

  • Smaller model, GPU acceleration, lower max_tokens, or reduce OLLAMA_NUM_CTX.

Why local#

  • Privacy โ€” code never leaves your machine

  • Cost โ€” zero per-token charges, run as many experiments as you want

  • Offline โ€” works on a plane

  • Reproducibility โ€” pin a quantization, get the same outputs

Next#

  • Browse good first issue and try a contribution end-to-end on local models.

  • See Configuration Reference for every option.

  • See PROTOCOL.md if youโ€™re building a worker runner for a non-Claude agent.