Field Guide — MCP Tools & Local LLMs

01 · Host Compatibility

MCP hosts we've tested

"PASS" means we sent a real mnemo_save through the host, then verified the row landed in our memory store. "FAKE" means the model narrated a convincing answer but never actually called the tool — the most expensive failure mode in this whole space.

Host	Status	Setup notes
Claude Desktop	PASS	Native MCP. Drag-and-drop `.mcpb` bundle is the gold standard for non-developers. Our bundle is submitted to Anthropic's Connectors Directory.
LM Studio	PASS	Native MCP since v0.3.17. Edit `%USERPROFILE%\.lmstudio\mcp.json`, restart, done. With Qwen3-8B it just works — tools fire when the model decides they're useful, no prefix or toggle.
AnythingLLM	PASS w/ Automatic mode	Two steps: drop config in `anythingllm_mcp_servers.json`, then flip the workspace to Automatic chat mode (Settings → Chat Settings). Without that flip, every tool call requires `@agent` prefix — users don't type it, memory never fires. Per-workspace setting, off by default.
Ollama + MCPHost	PASS console only on Windows	Ollama has no native MCP yet (issue #7865). MCPHost bridges. Works fine in a real terminal on macOS / Linux. On Windows, `mcphost.exe` output buffers until exit when spawned over SSH — run it from a real PowerShell / cmd window or it appears to hang.
OpenClaw	PASS	What our production agent (Rocky) runs on. MCP via stdio servers in `openclaw.json`'s `mcp.servers` block. Multi-server orchestration is solid; gateway respawns servers cleanly across upgrades.
Open WebUI	UNTESTED	Documented native MCP support. We have setup snippets in our Mnemo README; haven't run end-to-end ourselves yet.
llama.cpp	UNTESTED	`llama-server —mcp-config` is documented; we haven't run a load test.
LobeChat / Jan	UNTESTED	Both have MCP plugin / extension support. README has config blocks for them, but we haven't sat down and tested.

Verification trick

Don't trust the model's text response. Have it save a unique marker phrase (MNEMO-PROBE-LOBSTER-7777-2026-04-27), then query your Mnemo server directly for that exact string. If it's there, the save was real. If not, the model faked it. Our verifier script does this in one line: ~/scripts/mnemo-verify-save.sh "<marker>".

02 · Model Tool Calling

Which local models actually call MCP tools

Same bridge, same Mnemo server, same host (we used AnythingLLM in Automatic mode for the comparison). The only variable was the underlying model. The difference between PASS and FAKE here is "memory works" versus "memory looks like it works and doesn't."

Model	Tool calls real?	Notes
qwen3:8b	YES	Real `mnemo_save` call with structured args. Saved row visible in artforge with full content. Our default recommendation for any local-LLM tool-calling setup right now.
llama3.1:8b	NO — fakes	Narrates "saved with id `e4d3c9f3d1a2b8cd`" but the tool was never invoked. The memory ID is hallucinated. Recall in a new session returns nothing matching. This is the failure mode that costs you trust without telling you it's failing.
qwen2.5-coder:32b · qwen2.5:32b · qwen2.5vl:7b	UNTESTED for tools	Live on our machines for other tasks. Haven't validated tool-calling round-trips on these specific quants yet.
gpt-oss:20b	UNTESTED	OpenAI's open-weight 20B. On disk on IGOR-2; haven't run the verifier against it.

Why fake saves happen

Smaller and older instruction-tuned models often "perform" tool use in their text output rather than emitting a real function-call structure. The host parser sees only prose and never invokes the tool. The user sees a confident "saved!" message and walks away thinking memory works. It doesn't. In a recall later, the model either makes up a result or admits it can't find anything — either way, you've lost the data you thought you saved.

The fix is the model, not the host. Pick a model with explicit native tool-calling support and verify with a marker phrase the first time you use it. If the marker's not in your memory store after the save, throw the model out.

03 · Browser Automation

What we tried and what we ship

agent-browser (Vercel Labs) — what we ship now

Rust CLI, sub-200ms latency, ~14k GitHub stars. Six granular actions cover every browser task we've thrown at it: open, snapshot, click, fill, screenshot, eval. The model picks the action directly — no keyword detector in the middle to misclassify a request. Ships native binaries for Linux x64 and macOS.

npm install -g agent-browser
agent-browser install --with-deps

Repo: vercel-labs/agent-browser. We integrated it into our MCP tool server; the migration replaced ~300 lines of CDP/websocket plumbing with subprocess calls.

Midscene.js — vision-driven, integrated for Rocky 2026-04-25

Open-source (MIT), ~12k stars, ships its own MCP server. Pure-vision approach — screenshots only, no DOM required for actions. Model-agnostic; works with Qwen3-VL, Gemini, UI-TARS, others. We wired it into Rocky's openclaw.json as a second MCP server (midscene-web via npx -y @midscene/web-bridge-mcp) with gemini-3-flash-preview as the vision model. Different shape than agent-browser — better for tasks where the DOM is hostile (heavy SPAs, anti-bot sites that block CDP).

Repo: web-infra-dev/midscene.

Lightpanda — where we started, why we left

CDP-based headless browser, fast on the wire, but our experience over months: timeouts under load, fragile session reattach, stalls on real-world pages. Worked for example.com smoke tests; broke under client-site complexity. We kept the binary on disk — agent-browser supports it via --engine lightpanda — but Chrome for Testing is the default now.

Browserbase — evaluated, didn't fit

Cloud-hosted automation service. Well-built. Doesn't match self-hosted/local-first requirements; we cared about every browser session running on hardware we control.

What does not work, regardless of browser engine

Reddit. Hacker News (anti-bot is lighter but still trips up). Anything hiding behind Cloudflare Turnstile or hCaptcha. We pulled browser_agent from our research-agent's tool list after multiple MidStreamFallbackError crashes traced back to oversized DOM blobs returned from Reddit anti-bot pages. If your task involves scraping any of these, the answer is "use the public API, the RSS feed, or a different source," not a fancier browser.

04 · Web Search & Scrape

Search, scrape, and the ROI question

Tavily — search

What our search tool calls. Free tier is generous, MCP-friendly response shape, ranks results by source quality rather than just keyword density. Better defaults for agent use than vanilla Google or DuckDuckGo scraping.

Firecrawl — scraping

Installed on our IGOR machine for both Claude Code and Claude Desktop via their MCP entry. Cleaner page extraction than rolling our own parser; respects robots.txt; handles JS-rendered pages. Free tier covers most of our research workflow.

The honest take

If your agent is asking "what's the news on X" or "summarize this article" — one Tavily call plus one Firecrawl call is the right shape. If you're trying to scrape a site that doesn't want to be scraped, no amount of tooling makes that a sustainable workflow. Find an API, an RSS feed, a partner relationship, or pick a different source.

05 · Memory Architecture

Memory that survives the session

This is our day job — Mnemo Cortex is the persistent memory layer behind every Sparks agent. Three properties matter for anyone wiring memory into an MCP setup:

Local-first. SQLite + local embedding model (Ollama nomic-embed-text by default). $0 to run. Your memory data never leaves your machine unless you tell it to.
Cross-agent by design. Multiple agents (Claude Desktop, Claude Code, Ollama hosts, OpenClaw bots) share one memory spine if you want them to — or stay isolated per-agent if you don't. We default to "separate" for new installs.
Optional Mem0 bridge. When local recall is thin, Mnemo falls through to Mem0 for long-term semantic depth. "And Mem0," not "instead of Mem0."

Two layers, one bridge

Mnemo Cortex handles automatic conversation memory — what agents said, decided, and learned during sessions. mnemo-plan handles the manual half: a folder of markdown files in Git that you write and curate (project specs, active task lists, decision logs, architecture notes). Any LLM that can call the Mnemo MCP tools can read and edit them at session start, before the first message lands.

The same MCP bridge auto-detects both. Cortex tools always register; mnemo-plan tools light up only if you've set BRAIN_DIR on disk. Template repo on GitHub.

Detailed install, hosts, and setup matrix: projectsparks.ai/mnemo-cortex. Source: github.com/GuyMannDude/mnemo-cortex.

06 · Our Stack

Why our test results come from real production

This isn't a benchmark suite. It's the evidence trail of running a multi-agent setup that actually does work for us every day:

Opie — Claude Desktop. Architect / strategy / spec writer.
CC (Claude Code) — on-machine builder. The one wiring code, debugging configs, running migrations.
Rocky — OpenClaw production agent. Customer-facing tasks, scheduled jobs, the bot we'd let near a real workload.
Peter Widget — customer-facing chat on projectsparks.ai. Lighter agent, scoped tools.

Running on:

IGOR — primary Linux laptop. Ubuntu 24.04, where Opie / CC / Rocky live.
THE VAULT — Threadripper 3970X workstation, 128 GB RAM, dual RTX 3060. Hosts the Mnemo Cortex server, embeddings, dream synthesis.
IGOR-2 — Windows 11 machine. ComfyUI workstation, also our cross-platform test bench (LM Studio + AnythingLLM + Ollama all live here).

When we say "we tested AnythingLLM with qwen3:8b on Windows" — that's IGOR-2, in front of a real keyboard, with a real chat session and a real save row landing in artforge. Not a synthetic benchmark; not someone else's reproduction.