Ollama x Claude Code

#46
by josefacero01 - opened

Can we use your model inside claude code instead of sonnet or opus? and how does this model compare to the current sonnet if ever

Yes, you can wire it into Claude Code β€” I've run this end-to-end locally. Two pieces:

  1. Serve the GGUF with the native tool parser (the --jinja flag is what makes tool-calls come back as proper
    structured tool_calls instead of leaking as raw text):

llama-server -m gemma-4-12B-...-Q8_0.gguf --jinja -ngl 99 -fa on -c 8192 --port 8080

  1. Put a LiteLLM proxy in front to translate Anthropic ⇄ OpenAI, then point Claude Code at it:

config.yaml

model_list:
- model_name: "*"
litellm_params:
model: openai/local
api_base: http://localhost:8080/v1
api_key: dummy

litellm --config config.yaml --port 4000

then, for Claude Code:

ANTHROPIC_BASE_URL=http://localhost:4000
ANTHROPIC_AUTH_TOKEN=dummy
ANTHROPIC_MODEL=local
claude

(On Windows, run litellm.exe with PYTHONUTF8=1 set first β€” otherwise the startup banner crashes on cp1252.)

Now the honest part β€” how it compares to Sonnet: it doesn't, and it's not meant to. This is a 12B model; Sonnet/Opus
are vastly larger. For broad reasoning, large-context refactors, or "drive my whole repo autonomously," current Sonnet
will be clearly better. Where this model earns its place is: running fully local / offline / private, at zero API
cost, and on focused agentic/coding tasks (its tau2-bench telecom score went from ~15% to ~55%). Realistic use is as a
local worker for scoped tasks, or for experimentation β€” not as a drop-in Sonnet replacement.

Two things to watch: on trivial turns it tends to over-think (slower than you'd expect for a small model β€” I'm
addressing this in v3), and like other small local models it can sometimes claim it ran a command and report a result
it didn't actually produce, so verify its outputs. If you mainly want a Sonnet-level experience, use Sonnet; if you
want a capable local model you fully own, this is worth a try.

Sign up or log in to comment