Setting Up Local LLMs with Marcus#
Run Marcus end-to-end on your own hardware โ no API keys, no usage costs. This guide covers picking models, configuring Ollama, and running enough capacity to keep multiple agents busy in parallel.
What you need to set up#
Marcus is multi-agent. Two distinct LLM roles must both work:
Role |
What it does |
Hard requirement |
|---|---|---|
Planner |
Decomposes a project description into a task graph on the board. Marcus calls it once at |
Strong instruction-following + structured-output reliability |
Workers |
Actual coding agents (Claude Code, Codex, Aider, custom). Each pulls tasks from the board and writes code. |
Must support tool / function calling โ Marcus and MCP both depend on it |
โ ๏ธ Worker models without tool-calling will silently fail. They canโt invoke
request_next_task,report_task_progress,log_artifact, etc. If you pick a worker model, verify it advertises tool-calling support on its model card.
Recommended models#
๐ Top pick for Apple Silicon โ one model, both roles#
qwen3.5:35b-a3b-coding-nvfp4 runs comfortably on a 16GB+ M-series Mac and serves as both planner and worker. NVFP4 quantization is tuned for Apple Silicon โ strong code generation, reliable structured output, and tool-calling support. If youโre on a Mac, start here and skip the rest of the matrix.
ollama pull qwen3.5:35b-a3b-coding-nvfp4
Capacity on 16GB unified memory: 1 planner + ~2 workers concurrently.
Planner โ verified working#
Model |
Quantization |
Notes |
|---|---|---|
|
NVFP4 |
Best on Apple Silicon. Doubles as worker. |
|
Q4 or Q5 |
Lowest known-working planner. Reliable on modest hardware. |
|
Q4+ |
Larger planner option โ better task decomposition on complex projects. |
|
Q4+ |
Higher-quality plans when you have RAM to spare. |
Anything below 7B has not produced reliable plans in our testing.
Workers โ must support tool calling#
Model |
Notes |
|---|---|
|
Best on Apple Silicon. Same model can serve the planner. |
|
Tool-calling supported, strong code generation. |
|
Tool-calling supported. |
Hosted Claude / GPT via the worker agent itself |
The easiest path โ let Claude Code or Codex use their normal models. |
If youโre unsure whether a model supports tool calling, check the Ollama model page for โToolsโ in the capabilities list.
Running multiple workers in parallel#
One Ollama process serves requests serially per model. If two workers ask the same ollama instance for completions at the same time, the second request waits. To get real parallelism:
Option A โ multiple Ollama instances. Launch additional
ollama serveprocesses on different ports (OLLAMA_HOST=127.0.0.1:11435 ollama serve, then point a worker at:11435). One instance per concurrent worker.Option B โ
OLLAMA_NUM_PARALLEL. Setexport OLLAMA_NUM_PARALLEL=4before starting Ollama to let a single instance handle multiple requests concurrently. Each parallel slot uses additional VRAM โ verify you have headroom.Option C โ fewer workers. If hardware is tight, run 1 planner + 2 workers. Most coordination value shows up before you saturate the box.
Rule of thumb: 16GB unified memory โ 1 planner + 2 workers. 32GB+ โ 4+ workers comfortably.
Quick start#
1. Install Ollama#
curl -fsSL https://ollama.com/install.sh | sh
# Or download from https://ollama.com/download
2. Pull a model#
# Apple Silicon โ best dual-role pick (planner + workers)
ollama pull qwen3.5:35b-a3b-coding-nvfp4
# Or, the lowest known-working planner for modest hardware
ollama pull qwen2.5-coder:7b
# Or, a larger planner option
ollama pull ministral:14b
3. Point Marcus at it#
Edit config_marcus.json:
{
"ai": {
"provider": "local",
"enabled": true,
"local_model": "qwen3.5:35b-a3b-coding-nvfp4",
"local_url": "http://localhost:11434/v1",
"local_key": "none"
}
}
Or override with environment variables (these win over config_marcus.json):
export MARCUS_LLM_PROVIDER=local
export MARCUS_LOCAL_LLM_PATH=qwen3.5:35b-a3b-coding-nvfp4
export MARCUS_LOCAL_LLM_URL=http://localhost:11434/v1
4. Start Marcus#
./marcus start
./marcus board # check tasks land on the board
5. Wire your workers#
Each worker is a coding agent โ most commonly Claude Code, but any MCP-compatible agent works. Point each worker at its own Ollama endpoint (see โRunning multiple workers in parallelโ above) and confirm the model supports tool calling.
Complete configuration example#
{
"auto_find_board": false,
"kanban": {
"provider": "sqlite",
"sqlite_db_path": "./data/kanban.db",
"sqlite_attachments_dir": "./data/attachments"
},
"ai": {
"provider": "local",
"enabled": true,
"local_model": "qwen2.5-coder:7b",
"local_url": "http://localhost:11434/v1",
"local_key": "none",
"anthropic_api_key": "",
"openai_api_key": ""
},
"features": {
"events": true,
"context": true,
"memory": false,
"visibility": false
}
}
Advanced#
Non-Ollama OpenAI-compatible servers#
Anything that speaks the OpenAI API works (llama.cpp server, LocalAI, text-generation-webui, vLLM):
{
"ai": {
"provider": "local",
"local_model": "your-model",
"local_url": "http://localhost:8080/v1",
"local_key": "your-api-key-if-needed"
}
}
Configuration priority#
Environment variables (
MARCUS_*)config_marcus.jsonBuilt-in defaults
Ollama performance knobs#
export OLLAMA_NUM_CTX=8192 # bigger context window
export OLLAMA_NUM_PARALLEL=4 # concurrent requests per instance
export OLLAMA_KEEP_ALIVE=30m # keep model resident between calls
Local-provider request timeout is 120s by default.
Switching back to cloud#
export MARCUS_LLM_PROVIDER=anthropic # or openai
Or set "ai.provider" in config_marcus.json.
Troubleshooting#
Failed to connect to local LLM server
ollama listโ is Ollama actually running?curl http://localhost:11434/api/tagsโ does it answer?Did you pull the model?
ollama pull <model>
Worker silently does nothing / never calls request_next_task
The model likely lacks tool-calling support. Switch to a model whose card lists Tools as a capability.
Plans come back malformed / empty
Your planner model is too small or too quantized. Try
qwen2.5-coder:7bat Q5 minimum.
Second worker stalls when first is busy
One Ollama instance, no parallelism. Set
OLLAMA_NUM_PARALLELor run a secondollama serveon a different port.
Slow responses
Smaller model, GPU acceleration, lower
max_tokens, or reduceOLLAMA_NUM_CTX.
Why local#
Privacy โ code never leaves your machine
Cost โ zero per-token charges, run as many experiments as you want
Offline โ works on a plane
Reproducibility โ pin a quantization, get the same outputs
Next#
Browse
good first issueand try a contribution end-to-end on local models.See Configuration Reference for every option.
See PROTOCOL.md if youโre building a worker runner for a non-Claude agent.