Instructions to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF", filename="MTP/gemma-4-12B-it-MTP-BF16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Use Docker
docker model run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
- Ollama
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Ollama:
ollama run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
- Unsloth Studio
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF to start chatting
- Pi
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Docker Model Runner:
docker model run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
- Lemonade
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF-Q4_K_M
List all available models
lemonade list
How to use this model with Claude Code / Codex ?
I tried to use this model with LM Studio and connect via Claude Code.
However, it's getting error on jinja prompt.
Anyone have tested with Claude Code and give me good config to use ?
Thanks a lot!
H.
Hey H., thanks for trying it! I actually set this up and tested it end-to-end just now, so here's a config I canconfirm works.
First, the error itself: that jinja error is LM Studio's template engine ("minja") choking on this model's custom chat
template — it uses a thinking channel + Gemma 4's native tool-calling format, which relies on jinja features minja
doesn't fully support. So it fails before the model even runs. It's not your setup. The catch for Claude Code
specifically is that it leans entirely on tool-calling, which is exactly the part minja handles worst — so LM Studio
isn't the right server here.
What I tested and confirmed working: llama.cpp's llama-server with --jinja (its full jinja backend renders the native
template correctly) → a small proxy to translate Anthropic↔OpenAI → Claude Code CLI pointed at it.
- Serve the model:
llama-server -m gemma4-v2-Q4_K_M.gguf --jinja -ngl 99 --no-mmap -fa on --ctx-size 32768 --temp 1.0 --top-p 0.95
--top-k 64 --repeat-penalty 1.1 --host 127.0.0.1 --port 8080 - Bridge it (Claude Code speaks Anthropic's API, llama-server speaks OpenAI). I used LiteLLM with a one-line config
(model_name: "*" → api_base http://127.0.0.1:8080/v1), run it on port 4000. claude-code-router works too. (Windows
note: if LiteLLM crashes on startup with a UnicodeEncodeError, set PYTHONUTF8=1 / PYTHONIOENCODING=utf-8 first.) - Point Claude Code at the proxy:
ANTHROPIC_BASE_URL=http://127.0.0.1:4000
ANTHROPIC_AUTH_TOKEN=dummy
ANTHROPIC_MODEL=gemma4-v2
Results on my machine: a plain coding question (write is_prime) came back with correct, clean code. An agentic task
("create fizzbuzz.py and run it") worked too — it drove the tools, created the file, and the code it wrote was
correct.
One honest heads-up though: Claude Code is about the heaviest harness there is (huge system prompt + a dozen tools),
which is genuinely hard for any 12B. In longer agent loops this model can get shaky — e.g. in my test it wrote a
correct file but then mis-reported its own run output, so trust-but-verify what it tells you. For smoother local
agentic coding on a 12B, a lighter agent like opencode tends to behave better. But the setup above does work — give it
a go and let me know how it lands.
Hi, I noticed you mentioned that a lighter agent like opencode works better. I have been using LM-Studio as the backend and OpenCode as the agent and I keep getting this error:
"Error rendering prompt with jinja template: "Cannot call something that is not a function: got UndefinedValue".\n\nThis is usually an issue with the model's prompt template. If you are using a popular model, you can try to search the model under lmstudio-community, which will have fixed prompt templates. If you cannot find one, you are welcome to post this issue to our discord or issue tracker on GitHub. Alternatively, if you know how to write jinja templates, you can override the prompt template in My Models > model settings > Prompt Template."
I also saw you recommend llama.cpp, so I was wondering if there was gonna be any support for LM-Studio or Ollama.
Hi @TheZayBae — that UndefinedValue error is the same root cause as the earlier LM Studio jinja issue: LM Studio's
template engine (minja) doesn't fully implement the standard jinja2 features this model's template uses (the thinking
channel + Gemma 4's native tool-calling rely on .get(), namespace, etc., which minja doesn't support). It fails at
template-render time, before the model even runs — so it's not your config. It's also not something I can "fix" by
retraining: the model's template is standard for Gemma 4; the gap is on the client engine's side.
To your direct question — for an agent like OpenCode, which leans entirely on tool-calling, LM Studio is the wrong
backend, because tool rendering is exactly minja's weakest branch (even if you get past the chat error, tool calls
stay unreliable). The robust path — keep OpenCode, swap the backend to llama.cpp:
Serve with llama.cpp (full jinja backend + a real native-tool parser → clean structured tool_calls):
llama-server -m gemma4-v2-Q4_K_M.gguf --jinja -ngl 99 --no-mmap -fa on
--ctx-size 65536 --temp 1.0 --top-p 0.95 --top-k 64 --repeat-penalty 1.1
--host 127.0.0.1 --port 8080
(OpenCode's system prompt is large — give it room; 32k floor, 64k comfortable.)Point OpenCode at it as an OpenAI-compatible provider (baseURL: http://127.0.0.1:8080/v1), and in opencode.json set
the tool parser to raw-function-call + json — that's what makes OpenCode read Gemma 4's native tool format instead of
expecting its own. Keep the tool set small and use clear parameter names; a 12B over-calls less that way.
On LM Studio / Ollama support specifically:
- LM Studio: you can override the prompt template under My Models → model settings → Prompt Template (which is what
the error suggests) — I can share a minja-safe chat-only version — but I'd be upfront that it only fixes plain chat;
it won't make tool-calling reliable in minja, so it won't get OpenCode working. - Ollama: works if you hand it the official Gemma 4 Go template (ollama pull gemma4 then ollama show --modelfile
gemma4, copy the TEMPLATE block) and pin num_ctx. But again, for a tool-driven agent, llama.cpp --jinja is the most
reliable.
Bottom line: it's a client template-engine limitation, not something living in the weights — and llama.cpp --jinja is
the one backend I've confirmed returns clean structured tool calls for this model. Give that a go with OpenCode and
let me know how it lands.
Thank you so much for responding so fast! I look forward to using this model once I get everything all set up!
@quaestor Fair challenge — and no, no fork or custom build, it's stock OpenCode. The reason you couldn't find it is on me: I gave it as shorthand ("raw-function-call + json") instead of the exact key. It's an option on the @ai-sdk/openai-compatible provider, nested under options.toolParser, and it's an array. Here's the verbatim block:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"llama": {
"npm": "@ai-sdk/openai-compatible",
"name": "llama.cpp (local)",
"options": {
"baseURL": "http://127.0.0.1:8080/v1",
"toolParser": [
{ "type": "raw-function-call" },
{ "type": "json" }
]
},
"models": {
"gemma4-v2": {
"name": "Gemma 4 12B v2",
"tool_call": true,
"limit": { "context": 65536, "output": 8192 }
}
}
}
},
"model": "llama/gemma4-v2"
}
What it does: raw-function-call rewrites the tools into the legacy function-call format Gemma 4 emits natively, and json recovers any tool calls that come back as plain text — together that's what makes OpenCode read Gemma 4's native tool format instead of expecting its own.
Since it's a provider option for @ai-sdk/openai-compatible it's only lightly documented, which is why it doesn't turn up easily — the clearest concrete reference is this Gemma‑4‑on‑llama.cpp gist that uses the exact block: https://gist.github.com/daniel-farina/87dc1c394b94e45bb700d27e9ea03193 (and OpenCode's config/providers docs for the surrounding structure). Sorry for the run‑around — that snippet is the precise thing.
TYSM! JSON is just terrible for human-generated deeply-nested docs like this. It's basically just guesswork unless one has a concrete example...