# Setting Up Local LLMs with Marcus Run Marcus end-to-end on your own hardware — no API keys, no usage costs. This guide covers picking models, configuring Ollama, and running enough capacity to keep multiple agents busy in parallel. ## What you need to set up Marcus is multi-agent. Two distinct LLM roles must both work: | Role | What it does | Hard requirement | |------|--------------|------------------| | **Planner** | Decomposes a project description into a task graph on the board. Marcus calls it once at `create_project` time. | Strong instruction-following + structured-output reliability | | **Workers** | Actual coding agents (Claude Code, Codex, Aider, custom). Each pulls tasks from the board and writes code. | **Must support tool / function calling** — Marcus and MCP both depend on it | > ⚠️ **Worker models without tool-calling will silently fail.** They can't invoke `request_next_task`, `report_task_progress`, `log_artifact`, etc. If you pick a worker model, verify it advertises tool-calling support on its model card. ## Recommended models ### 🏆 Top pick for Apple Silicon — one model, both roles **`qwen3.5:35b-a3b-coding-nvfp4`** runs comfortably on a 16GB+ M-series Mac and serves as **both planner and worker**. NVFP4 quantization is tuned for Apple Silicon — strong code generation, reliable structured output, and tool-calling support. If you're on a Mac, start here and skip the rest of the matrix. ```bash ollama pull qwen3.5:35b-a3b-coding-nvfp4 ``` Capacity on 16GB unified memory: 1 planner + ~2 workers concurrently. ### Planner — verified working | Model | Quantization | Notes | |-------|--------------|-------| | `qwen3.5:35b-a3b-coding-nvfp4` | NVFP4 | **Best on Apple Silicon.** Doubles as worker. | | `qwen2.5-coder:7b` | **Q4 or Q5** | **Lowest known-working planner.** Reliable on modest hardware. | | `ministral:14b` (Ministral-3-14B) | Q4+ | Larger planner option — better task decomposition on complex projects. | | `qwen2.5-coder:14b` | Q4+ | Higher-quality plans when you have RAM to spare. | Anything below 7B has not produced reliable plans in our testing. ### Workers — must support tool calling | Model | Notes | |-------|-------| | `qwen3.5:35b-a3b-coding-nvfp4` | **Best on Apple Silicon.** Same model can serve the planner. | | `qwen2.5-coder:7b` / `:14b` / `:32b` | Tool-calling supported, strong code generation. | | `deepseek-coder` (instruct variants) | Tool-calling supported. | | Hosted Claude / GPT via the worker agent itself | The easiest path — let Claude Code or Codex use their normal models. | If you're unsure whether a model supports tool calling, check the Ollama model page for "Tools" in the capabilities list. ### Running multiple workers in parallel **One Ollama process serves requests serially per model.** If two workers ask the same `ollama` instance for completions at the same time, the second request waits. To get real parallelism: - **Option A — multiple Ollama instances.** Launch additional `ollama serve` processes on different ports (`OLLAMA_HOST=127.0.0.1:11435 ollama serve`, then point a worker at `:11435`). One instance per concurrent worker. - **Option B — `OLLAMA_NUM_PARALLEL`.** Set `export OLLAMA_NUM_PARALLEL=4` before starting Ollama to let a single instance handle multiple requests concurrently. Each parallel slot uses additional VRAM — verify you have headroom. - **Option C — fewer workers.** If hardware is tight, run 1 planner + 2 workers. Most coordination value shows up before you saturate the box. Rule of thumb: 16GB unified memory → 1 planner + 2 workers. 32GB+ → 4+ workers comfortably. ## Quick start ### 1. Install Ollama ```bash curl -fsSL https://ollama.com/install.sh | sh # Or download from https://ollama.com/download ``` ### 2. Pull a model ```bash # Apple Silicon — best dual-role pick (planner + workers) ollama pull qwen3.5:35b-a3b-coding-nvfp4 # Or, the lowest known-working planner for modest hardware ollama pull qwen2.5-coder:7b # Or, a larger planner option ollama pull ministral:14b ``` ### 3. Point Marcus at it Edit `config_marcus.json`: ```json { "ai": { "provider": "local", "enabled": true, "local_model": "qwen3.5:35b-a3b-coding-nvfp4", "local_url": "http://localhost:11434/v1", "local_key": "none" } } ``` Or override with environment variables (these win over `config_marcus.json`): ```bash export MARCUS_LLM_PROVIDER=local export MARCUS_LOCAL_LLM_PATH=qwen3.5:35b-a3b-coding-nvfp4 export MARCUS_LOCAL_LLM_URL=http://localhost:11434/v1 ``` ### 4. Start Marcus ```bash ./marcus start ./marcus board # check tasks land on the board ``` ### 5. Wire your workers Each worker is a coding agent — most commonly Claude Code, but any MCP-compatible agent works. Point each worker at its own Ollama endpoint (see "Running multiple workers in parallel" above) and confirm the model supports tool calling. ## Complete configuration example ```json { "auto_find_board": false, "kanban": { "provider": "sqlite", "sqlite_db_path": "./data/kanban.db", "sqlite_attachments_dir": "./data/attachments" }, "ai": { "provider": "local", "enabled": true, "local_model": "qwen2.5-coder:7b", "local_url": "http://localhost:11434/v1", "local_key": "none", "anthropic_api_key": "", "openai_api_key": "" }, "features": { "events": true, "context": true, "memory": false, "visibility": false } } ``` ## Advanced ### Non-Ollama OpenAI-compatible servers Anything that speaks the OpenAI API works (`llama.cpp` server, LocalAI, text-generation-webui, vLLM): ```json { "ai": { "provider": "local", "local_model": "your-model", "local_url": "http://localhost:8080/v1", "local_key": "your-api-key-if-needed" } } ``` ### Configuration priority 1. Environment variables (`MARCUS_*`) 2. `config_marcus.json` 3. Built-in defaults ### Ollama performance knobs ```bash export OLLAMA_NUM_CTX=8192 # bigger context window export OLLAMA_NUM_PARALLEL=4 # concurrent requests per instance export OLLAMA_KEEP_ALIVE=30m # keep model resident between calls ``` Local-provider request timeout is 120s by default. ### Switching back to cloud ```bash export MARCUS_LLM_PROVIDER=anthropic # or openai ``` Or set `"ai.provider"` in `config_marcus.json`. ## Troubleshooting **`Failed to connect to local LLM server`** - `ollama list` — is Ollama actually running? - `curl http://localhost:11434/api/tags` — does it answer? - Did you pull the model? `ollama pull ` **Worker silently does nothing / never calls `request_next_task`** - The model likely lacks tool-calling support. Switch to a model whose card lists Tools as a capability. **Plans come back malformed / empty** - Your planner model is too small or too quantized. Try `qwen2.5-coder:7b` at Q5 minimum. **Second worker stalls when first is busy** - One Ollama instance, no parallelism. Set `OLLAMA_NUM_PARALLEL` or run a second `ollama serve` on a different port. **Slow responses** - Smaller model, GPU acceleration, lower `max_tokens`, or reduce `OLLAMA_NUM_CTX`. ## Why local - **Privacy** — code never leaves your machine - **Cost** — zero per-token charges, run as many experiments as you want - **Offline** — works on a plane - **Reproducibility** — pin a quantization, get the same outputs ## Next - Browse [`good first issue`](https://github.com/lwgray/marcus/labels/good%20first%20issue) and try a contribution end-to-end on local models. - See [Configuration Reference](../developer/configuration.md) for every option. - See [PROTOCOL.md](../../../PROTOCOL.md) if you're building a worker runner for a non-Claude agent.