What actually works when you wire MCP tools into local LLMs.
We run a multi-agent production stack — Mnemo Cortex for memory, FrankenClaw-style MCP servers for tools, agents on Claude Desktop, Claude Code, OpenClaw, and Ollama hosts. Every line below is something we've actually run on our own machines and verified the result of. No marketing benchmarks. No "works on the author's laptop" tutorials.
Updated 2026-04-27. We add to this when we test something new. If a host or model is missing, we haven't run it yet — we won't fake the row.
"PASS" means we sent a real mnemo_save through the host, then verified the row landed in our memory store. "FAKE" means the model narrated a convincing answer but never actually called the tool — the most expensive failure mode in this whole space.
| Host | Status | Setup notes |
|---|---|---|
| Claude Desktop | PASS | Native MCP. Drag-and-drop .mcpb bundle is the gold standard for non-developers. Our bundle is submitted to Anthropic's Connectors Directory. |
| LM Studio | PASS | Native MCP since v0.3.17. Edit %USERPROFILE%\.lmstudio\mcp.json, restart, done. With Qwen3-8B it just works — tools fire when the model decides they're useful, no prefix or toggle. |
| AnythingLLM | PASS w/ Automatic mode | Two steps: drop config in anythingllm_mcp_servers.json, then flip the workspace to Automatic chat mode (Settings → Chat Settings). Without that flip, every tool call requires @agent prefix — users don't type it, memory never fires. Per-workspace setting, off by default. |
| Ollama + MCPHost | PASS console only on Windows | Ollama has no native MCP yet (issue #7865). MCPHost bridges. Works fine in a real terminal on macOS / Linux. On Windows, mcphost.exe output buffers until exit when spawned over SSH — run it from a real PowerShell / cmd window or it appears to hang. |
| OpenClaw | PASS | What our production agent (Rocky) runs on. MCP via stdio servers in openclaw.json's mcp.servers block. Multi-server orchestration is solid; gateway respawns servers cleanly across upgrades. |
| Open WebUI | UNTESTED | Documented native MCP support. We have setup snippets in our Mnemo README; haven't run end-to-end ourselves yet. |
| llama.cpp | UNTESTED | llama-server —mcp-config is documented; we haven't run a load test. |
| LobeChat / Jan | UNTESTED | Both have MCP plugin / extension support. README has config blocks for them, but we haven't sat down and tested. |
Don't trust the model's text response. Have it save a unique marker phrase (MNEMO-PROBE-LOBSTER-7777-2026-04-27), then query your Mnemo server directly for that exact string. If it's there, the save was real. If not, the model faked it. Our verifier script does this in one line: ~/scripts/mnemo-verify-save.sh "<marker>".
Same bridge, same Mnemo server, same host (we used AnythingLLM in Automatic mode for the comparison). The only variable was the underlying model. The difference between PASS and FAKE here is "memory works" versus "memory looks like it works and doesn't."
| Model | Tool calls real? | Notes |
|---|---|---|
| qwen3:8b | YES | Real mnemo_save call with structured args. Saved row visible in artforge with full content. Our default recommendation for any local-LLM tool-calling setup right now. |
| llama3.1:8b | NO — fakes | Narrates "saved with id e4d3c9f3d1a2b8cd" but the tool was never invoked. The memory ID is hallucinated. Recall in a new session returns nothing matching. This is the failure mode that costs you trust without telling you it's failing. |
| qwen2.5-coder:32b · qwen2.5:32b · qwen2.5vl:7b | UNTESTED for tools | Live on our machines for other tasks. Haven't validated tool-calling round-trips on these specific quants yet. |
| gpt-oss:20b | UNTESTED | OpenAI's open-weight 20B. On disk on IGOR-2; haven't run the verifier against it. |
Smaller and older instruction-tuned models often "perform" tool use in their text output rather than emitting a real function-call structure. The host parser sees only prose and never invokes the tool. The user sees a confident "saved!" message and walks away thinking memory works. It doesn't. In a recall later, the model either makes up a result or admits it can't find anything — either way, you've lost the data you thought you saved.
The fix is the model, not the host. Pick a model with explicit native tool-calling support and verify with a marker phrase the first time you use it. If the marker's not in your memory store after the save, throw the model out.
Rust CLI, sub-200ms latency, ~14k GitHub stars. Six granular actions cover every browser task we've thrown at it: open, snapshot, click, fill, screenshot, eval. The model picks the action directly — no keyword detector in the middle to misclassify a request. Ships native binaries for Linux x64 and macOS.
npm install -g agent-browser agent-browser install --with-deps
Repo: vercel-labs/agent-browser. We integrated it into our MCP tool server; the migration replaced ~300 lines of CDP/websocket plumbing with subprocess calls.
Open-source (MIT), ~12k stars, ships its own MCP server. Pure-vision approach — screenshots only, no DOM required for actions. Model-agnostic; works with Qwen3-VL, Gemini, UI-TARS, others. We wired it into Rocky's openclaw.json as a second MCP server (midscene-web via npx -y @midscene/web-bridge-mcp) with gemini-3-flash-preview as the vision model. Different shape than agent-browser — better for tasks where the DOM is hostile (heavy SPAs, anti-bot sites that block CDP).
Repo: web-infra-dev/midscene.
CDP-based headless browser, fast on the wire, but our experience over months: timeouts under load, fragile session reattach, stalls on real-world pages. Worked for example.com smoke tests; broke under client-site complexity. We kept the binary on disk — agent-browser supports it via --engine lightpanda — but Chrome for Testing is the default now.
Cloud-hosted automation service. Well-built. Doesn't match self-hosted/local-first requirements; we cared about every browser session running on hardware we control.
Reddit. Hacker News (anti-bot is lighter but still trips up). Anything hiding behind Cloudflare Turnstile or hCaptcha. We pulled browser_agent from our research-agent's tool list after multiple MidStreamFallbackError crashes traced back to oversized DOM blobs returned from Reddit anti-bot pages. If your task involves scraping any of these, the answer is "use the public API, the RSS feed, or a different source," not a fancier browser.
What our search tool calls. Free tier is generous, MCP-friendly response shape, ranks results by source quality rather than just keyword density. Better defaults for agent use than vanilla Google or DuckDuckGo scraping.
Installed on our IGOR machine for both Claude Code and Claude Desktop via their MCP entry. Cleaner page extraction than rolling our own parser; respects robots.txt; handles JS-rendered pages. Free tier covers most of our research workflow.
If your agent is asking "what's the news on X" or "summarize this article" — one Tavily call plus one Firecrawl call is the right shape. If you're trying to scrape a site that doesn't want to be scraped, no amount of tooling makes that a sustainable workflow. Find an API, an RSS feed, a partner relationship, or pick a different source.
This is our day job — Mnemo Cortex is the persistent memory layer behind every Sparks agent. Three properties matter for anyone wiring memory into an MCP setup:
nomic-embed-text by default). $0 to run. Your memory data never leaves your machine unless you tell it to.Mnemo Cortex handles automatic conversation memory — what agents said, decided, and learned during sessions. mnemo-plan handles the manual half: a folder of markdown files in Git that you write and curate (project specs, active task lists, decision logs, architecture notes). Any LLM that can call the Mnemo MCP tools can read and edit them at session start, before the first message lands.
The same MCP bridge auto-detects both. Cortex tools always register; mnemo-plan tools light up only if you've set BRAIN_DIR on disk. Template repo on GitHub.
Detailed install, hosts, and setup matrix: projectsparks.ai/mnemo-cortex. Source: github.com/GuyMannDude/mnemo-cortex.
This isn't a benchmark suite. It's the evidence trail of running a multi-agent setup that actually does work for us every day:
Running on:
When we say "we tested AnythingLLM with qwen3:8b on Windows" — that's IGOR-2, in front of a real keyboard, with a real chat session and a real save row landing in artforge. Not a synthetic benchmark; not someone else's reproduction.