Instructions to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF", filename="MTP/gemma-4-12B-it-MTP-BF16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Use Docker
docker model run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
- Ollama
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Ollama:
ollama run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
- Unsloth Studio
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF to start chatting
- Pi
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Docker Model Runner:
docker model run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
- Lemonade
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF-Q4_K_M
List all available models
lemonade list
Great agentic jump, but very slow for conversational/chat use β latency data inside
Hi,
First, congratulations on v2 β the tau2-bench telecom jump (~15% β ~55%) is genuinely impressive, and it's awesome to have a capable coding/agentic 12B that fits on modest hardware. I wanted to share some honest real-world feedback from running it as a personal assistant (Telegram chat bot + scheduled tasks), in case it's useful for v3.
My setup
Quant: Q4_K_M (your recommended sweet spot)
Backend: Ollama (ROCm, AMD GPU), Flash Attention on, KV cache q8_0, ctx 32K
Samplers per your model card: temp 1.0 / top_p 0.95 / top_k 64 / rep_pen 1.1
Running through an agent framework (nanobot), so the model sees tools + system prompt on every turn
The headline
It works, but for reactive/conversational use it's dramatically slower than the base gemma-4-12B-it. I measured end-to-end latencies (latency_ms logged per assistant turn) by swapping the base model for v2 and back:
- Scenario Base gemma-4-12B-it v2 (this model)
- Simple news briefing request (Telegram) 33 s 413 s (~7 min)
- Memory-consolidation "nothing to do" (Dream) 44β60 s 45β117 s
- Same task, but v2 generates up to 15 200 chars of reasoning 0 3 700β15 200 chars
- So ~12Γ slower on a typical chat turn, even though the base model produced the same kind of answer.
What I observed
Two behaviors seem to drive the latency:
Very verbose thinking, even when nothing needs thinking. The thought channel fires on every turn, including trivial ones. A memory-consolidation run with literally nothing in the history produced a ~15K-char internal monologue ("Wait... Actually... Self-correction...") before concluding there was nothing to do. The base model just returns immediately.
"Keep going" becomes "keep trying" on failed tool calls. On the news request, v2 repeatedly called web_fetch, each one blocked ("repeated external lookup blocked"), and instead of backing off it kept reasoning over the failures. That perseverance is clearly the agentic strength you tuned for β but in a chat assistant it turns into a loop.
I did follow your pinned notes (rep_pen 1.1, temp 1.0, native tool format), so this doesn't look like a sampler misconfig β it reads more as the model wanting to think/try hard by default.
Suggestions for v3 (if useful)
An easier path to disable or cap thinking for non-agentic turns (e.g. a chat-template toggle or a "concise" mode), so the same weights can serve both agentic and chat workloads.
Early-stop on repeated tool failures β after N blocked/identical tool calls, fall back to a direct answer rather than looping.
Maybe two recipes: the full "keep going" agentic one, and a lighter "chat-friendly" fine-tune.
Bottom line
For agentic/coding loops where you want it to grind through a problem, v2 is excellent. For a snappy conversational assistant, the base model is currently a better fit. I'd love a chat-friendly variant β I'd switch back in a heartbeat.
Thanks for the work and for being so open about the trade-offs. Looking forward to v3!
Hi, thank you for this. Honestly it's one of the most useful reports I've gotten. Actual latency numbers plus a clean
A/B against the base model is exactly what helps me. And your read is right: this isn't a sampler issue, it's the
model doing what v2 was tuned to do. The agentic data taught it to think hard and keep going, and on a trivial chat
turn that turns into a 15K-char monologue or a retry loop on a blocked tool. You basically reproduced from the outside
the exact thing I'm targeting for v3, where the failure mode is over-reasoning and re-issuing the same failed tool
call, not lack of knowledge.
A few things that might help right now, before v3:
On the thinking: Gemma 4's thinking channel is meant to be gateable per turn, and Ollama exposes it as a request
parameter (think: false, a thinking_budget, or /no_think in the prompt). If nanobot can set that, you could leave
thinking on for agentic turns and turn it off or cap the budget for plain chat turns. Same weights, two modes. Two
caveats from the field: to make "off" reliable you usually have to hint it in both the system and the user message,
and full-off can be finicky on Gemma 4, so capping the budget is often the more reliable lever. Also make sure prior
thought blocks aren't being fed back into history, only the final answer should carry over, that alone trims a lot of
the bloat. One known gotcha: some llama.cpp/Ollama builds flood tokens when thinking is disabled, so update
if you hit that.
On the tool loop: that part is partly the harness. A guard that caps identical or blocked tool calls (after N repeats,
stop and answer directly) will fix the news-fetch loop today regardless of the model. But you're right that the model
should learn to back off on its own, and that's squarely on the v3 list.
For v3 I'm already prioritizing exactly this execution-robustness side: back off after repeated failures, and don't
over-reason trivial turns. That should help conversational latency a lot even without a separate model. A dedicated
chat-friendly recipe is something I'm genuinely weighing too, since the "keep going" intensity is great for coding
loops and too much for a snappy assistant. Your "I'd switch back in a heartbeat" is a strong signal, noted.
Thanks again. This is the kind of feedback that actually moves the next version.
Great !
I don't have anything useful to add here but I do want to say that this is a part of the open source community that I love. Honest feedback with iteration and openness.
Love the model and agree the speed bump would be fantastic with the reasoning enhancements. I'll try to pull some of the levers you mentioned.