--- license: apache-2.0 base_model: google/gemma-4-12B-it library_name: gguf pipeline_tag: text-generation tags: [gemma4, coding, agentic, terminal, tool-use, reasoning, thinking, gguf, llama.cpp, local-llm] --- # ๐Ÿ’ป๐Ÿค– Gemma4-12B **v2** โ€” Coding + Agentic Edition โœจ ### ๐Ÿฃ Tiny footprint, big brain โ€” a local **coding & tool-using agent** for *everyone* > **No matter your GPU. No matter your RAM.** With **~4.5 GB** of VRAM *or* unified memory free, you can run your own > private, offline coding **agent** right now. ๐Ÿš€ v2 is the big **agentic** upgrade โ€” it reads, reasons, *uses tools*, > and works through multi-step technical tasks before it acts. ๐Ÿง ๐Ÿ› ๏ธ All local, all yours, no API, no cloud. --- ## ๐Ÿ“Š The headline โ€” it works as an agent (tau2-bench) v2 is built for **coding + agentic** work โ€” writing code, running commands, using tools, debugging, multi-step technical tasks. The clearest signal is **tau2-bench `telecom`**, an agentic tool-use benchmark whose *diagnose โ†’ fix โ†’ verify* loop mirrors real terminal/debugging work: | tau2-bench **telecom** ยท 20 tasks ยท local, same harness, **all Q8_0** | score | |---|---| | official `gemma-4-12B-it` (base) | **~15%** | | ๐ŸŸข **Gemma4-12B v2 (this model)** | **~55%** | โ†’ Roughly **3.5ร— higher** than the base model on technical-agentic tasks. ๐ŸŽฏ **Want the full story** โ€” *why* telecom, *how* the two models fail differently, the honest caveats, and the trade-offs (including general knowledge)? **It's all broken down further below. ๐Ÿ‘‡** --- ## ๐Ÿš€ Announcements **๐Ÿ“Œ Hitting a problem? Please check my pinned discussion first.** **~99% of issues are a client/sampler config, not the weights** โ€” and they have a quick fix there. For example: garbled or **repeating `0000โ€ฆ`** output almost always means **no repetition penalty** (set `rep_pen 1.1`, `temp 1.0`); and leaked `<|tool_call>` / `<|channel>` tokens mean your front-end isn't parsing Gemma 4's **native tool format** (use llama.cpp `--jinja`). If your question isn't covered, **don't hesitate to open a discussion** โ€” I read them and reply as fast as I can. ๐Ÿ’ฌ **๐Ÿ“ฆ No Q2_K this release.** I finished a Q2_K (imatrix) build, but it didn't hold up under real stress-testing, so I'm holding it back โ€” **I only ship a quant once I'm confident it's genuinely good.** Smallest reliable option is **Q3_K_M**; **Q4_K_M** is the recommended sweet spot. ๐Ÿ™ **๐Ÿ”ฎ v3 is already on the way.** Honestly? Even *I* didn't expect the post-training jump to be **this** large โ€” so I'm pushing further. v3 keeps the **coding + agentic** focus and aims higher still. Stay tuned! ๐ŸŽ‰ **๐Ÿ˜ And a bigger sibling is coming โ€” Qwen3.6-27B.** I've also started fine-tuning **Qwen3.6-27B** with the same **coding + agentic** recipe, for those of you who *do* have the headroom and want more raw capability. But I haven't forgotten what this project is about: a **27B may be too heavy** for some of your GPUs / RAM. So this is **not** a replacement โ€” I'm pushing **v3 (this 12B line) in parallel, at the same time**, and it will **only get stronger**. ๐Ÿ’ช **No matter your hardware, you'll have a model that fits.** ๐Ÿ’š --- ## ๐Ÿ’š A personal note โ€” thank you, and a few honest words (please read) **First, a huge thank-you for all the data and help you've shared.** ๐Ÿ™ The bittersweet part: none of us saw it coming that **Fable 5 would be retired** โ€” and only my *own* dataset holds Fable 5's **genuine, self-authored** chain-of-thought. So for every dataset the community contributed, I **rebuilt the missing reasoning from scratch with Opus 4.8 (xhigh)**. It may diverge from the original Fable 5 traces, but it was the only workable path โ€” and the **improvement turned out really, *really* huge** (it nearly launched me out of my chair ๐Ÿ˜„). The benchmark numbers are right above. ๐Ÿ‘† **Second** โ€” I've tried to **reply to every community comment**, and I've openly **owned v1's training problems**. Truly, thank you: your feedback is what lets me improve. ๐Ÿ’š Because v1 hit **#1 trending**, it also attracted some **bad words / trolling**. I'll say this gently but firmly: **real criticism is always welcome here โ€” pure insults are not.** This is a **local** model that lets anyone run a capable AI on tiny RAM/VRAM, at **zero API cost** and fully **private**; I even open-sourced the **full safetensors master** to study and build on. If something's off, **open a discussion about the actual problem** โ€” I genuinely want to hear it and I'll act on it. But comments that are *only* insults help no one, and I'll remove them without hesitation. ๐Ÿ™ Please remember: **I'm one person** โ€” not a lab shipping an "open" model for marketing or to monetize later. I don't advertise. I build this for you on **my own time and my own money**: synthesizing data, reviewing and cleaning it by hand, splitting and re-segmenting it (this round I even built a **dynamic context-window** pass to keep the agent's *read-before-act* steps intact), reading the latest papers, then training โ†’ evaluating โ†’ training โ†’ evaluating. It burned through an **entire Claude Max 20ร— plan** (I keep a separate Pro for my own work), and **v2 alone cost 40+ hours** โ€” even with Opus 4.8, the data threw constant curveballs I had to verify myself. Thank you, truly. ๐Ÿพ --- ## ๐Ÿ”ฌ The benchmarks, in detail (tau2-bench) I evaluated v2 on **tau2-bench** (an agentic tool-use benchmark). I did **not** run the whole suite โ€” it's very time-consuming โ€” so I focused on the single domain that best matches what v2 is for. **Why tau2-bench `telecom`?** Telecom troubleshooting makes the agent **diagnose with read/inspect tools โ†’ pinpoint the issue โ†’ apply a fix โ†’ verify it** โ€” structurally the *same loop* as real terminal/debugging work (*check state โ†’ diagnose โ†’ fix โ†’ confirm*). That's exactly what this model is meant to be good at, which makes it the right yardstick for v2 (much more so than a shopping/customer-service domain). | tau2-bench **telecom** ยท 20 tasks ยท local, same harness, **all Q8_0** | score | |---|---| | official `gemma-4-12B-it` (base) | **~15%** | | ๐ŸŸข **Gemma4-12B v2 (this model)** | **~55%** | โ†’ Roughly **3.5ร— higher** than the base model on technical-agentic tasks. ๐ŸŽฏ **Grounded, not made-up.** Independently, a coding/terminal *fabrication probe* (tasks that deliberately tempt the model to invent file paths / function signatures / values) found v2 **grounds before it acts** just like the base โ€” it `grep`/`read`/`ls` first, and **doesn't make things up** (0% fabrication, on par with the base model). **The interesting part โ€” *how* they fail.** The **base model gives up early**: on this run it bailed to a human agent **10 times** (`transfer_to_human`) instead of finishing the fix. **v2 keeps going** โ€” it stays in the loop and works the problem the way a much bigger model would, which is exactly why it solves so many more. It's not perfect yet: it still **flails a little sometimes** (over-trying, retrying). And some of the remaining misses are actually a **bug in the benchmark's own APN tool** (it throws on inputs it should handle gracefully), not the model. To be clear: **I will not patch the benchmark's tools or leak its test questions just to inflate my score** โ€” I'd rather report an honest number and improve the *model* itself. **More training is coming in v3.** ๐Ÿ”ง **About `retail` (customer-service shopping):** on tau2-bench *retail*, the base model scores a bit higher than v2. **This is fully expected and by design.** Retail is pure customer-service (look up a user, process an order) โ€” *not* what this model is for. v2 is specialized for **coding / terminal / technical-agentic** work, and on those (telecom) it dramatically outperforms the base. Need a customer-service bot? This isn't it. Need a **local coding/agentic** model? It is. ๐Ÿ’š **Let's keep it honest about scale.** Today's *frontier* models โ€” think **mimo-v2.5-pro** or **Opus 4.8** โ€” all land **90%+** on this telecom benchmark. They're also *enormous*. For a **12B** model, my rough *guess* is that v3 might top out somewhere around **60โ€“70%** (emphasis on *guess* โ€” I haven't even started v3 yet). So let's be clear-eyed: there's still a real gap to the frontier. But keep the scale in mind โ€” **this is a 12B model running on your own machine**, and narrowing that gap as much as possible *at this size* is the whole point. ๐Ÿ’ช **And the trade-off โ€” there's no free lunch.** I also ran a general-knowledge benchmark (**MMLU-Pro**), and v2 lands **a little below the base model** there. That's **completely normal and expected** for a focused fine-tune: when you push hard on coding + agentic, you trade a sliver of broad-knowledge breadth for it. Need a generalist? Try my own general-purpose **[Claude Opus 4.6/4.8 distillation](https://huggingface.co/yuxinlu1/gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUF)** โ€” or the original **`google/gemma-4-12B-it`** base. Need a **local coding/agentic** worker? That's what v2 is tuned for. > ๐Ÿ”ฌ *Methodology, honestly:* these are **local, same-harness, relative** numbers (**all models tested at Q8_0**, greedy > decoding, self-simulated user, 20 tasks). They are **not** directly comparable to published tau2-bench leaderboard > figures (different user-simulator, full task sets, full precision) โ€” local self-eval runs *systematically lower* than > published scores. Read them as **"v2 vs the base model under identical conditions"**, which is the comparison that > actually matters here. --- ## ๐Ÿ“š What's new in v2 (training) v2 continues from the v1 coder and adds a big **agentic** push โ€” the piece v1 was missing: - **๐Ÿ› ๏ธ Agentic / terminal** โ€” real **multi-step tool-use** trajectories (*read โ†’ reason โ†’ act โ†’ verify*), in Gemma 4's native tool protocol. This is what drove the tau2-bench telecom jump, and it fixes v1's "stops after the first step" behavior. - **๐Ÿ’ป Coding** โ€” verified chain-of-thought over Python tasks (**real CoT, gated on passing tests**) plus the Fable-5-redo set for the hard cases. - **๐Ÿ“š General** โ€” a curated slice of reasoning/instruction data to keep broad competence. All reasoning is **distilled CoT** (see the personal note above on how the Fable 5 traces were rebuilt with Opus 4.8). --- ## ๐Ÿ“ฆ Pick your size (GGUF quants) | Quant | Size | Vibe | |------|------|------| | ๐ŸŸก **Q3_K_M** | **5.7 GB** | great for 8 GB VRAM | | ๐Ÿ”ต **Q4_K_M** | **6.87 GB** | the sweet spot ๐Ÿ‘Œ (recommended) | | ๐ŸŸฃ **Q6_K** | **9.11 GB** | near-lossless | | โšช **Q8_0** | **11.8 GB** | basically full quality | > โ„น๏ธ *No **Q2_K** this release โ€” it didn't pass stress-testing yet (see Announcements). Smallest reliable quant = **Q3_K_M**.* --- ## ๐Ÿš€ How to run it ### Option A โ€” llama.cpp (recommended) ๐Ÿฆ™ > โš ๏ธ Needs a **recent llama.cpp** (this is the `gemma4_unified` architecture โ€” older builds won't load it). ```bat @echo off cd /d C:\llama.cpp llama-server.exe ^ -m C:\models\gemma4-v2-Q4_K_M.gguf ^ --ctx-size 16384 ^ --n-gpu-layers 99 ^ --no-mmap -fa on ^ --jinja ^ --temp 1.0 --top-p 0.95 --top-k 64 ^ --host 0.0.0.0 --port 18080 pause ``` - **๐Ÿ› ๏ธ Agentic use:** pass your tools via the OpenAI `tools` field (works with `--jinja`). v2 emits structured tool-calls in Gemma 4's native protocol and is happy in agent loops (read/grep/edit/run, then verify). - **๐Ÿ–ฑ๏ธ One-click apps:** LM Studio / Jan / Ollama โ€” import the GGUF, pick a quant, go. ### ๐Ÿง  Thinking mode v2 thinks in Gemma's native thought channel before answering (keep `enable_thinking=true`, the default chat template handles it). Recommended sampling: `temp 1.0, top_p 0.95, top_k 64`; for coding you can also go greedy (`temp 0`). --- ## โš ๏ธ Good to know - **Specialized for coding / terminal / agentic.** General-knowledge facts/numbers should still be double-checked. - **Reduced refusals:** task-focused training, not safety-aligned โ€” add your own guardrails for production. Use responsibly. ๐Ÿ™ - English-centric. --- ## ๐Ÿ“š Base & License - **License: Apache 2.0.** Gemma 4 is released by Google under **[Apache 2.0](https://ai.google.dev/gemma/apache_2)** (unlike the older Gemma 1/2/3 terms), so this fine-tune is **Apache 2.0** too โ€” free to use, modify, and redistribute. ๐ŸŽ‰ - **Base model:** [`google/gemma-4-12B-it`](https://huggingface.co/google/gemma-4-12B-it). - Personal/hobby project โ€” shared as-is, no warranty. Built with time, care, and a lot of coffee. Have fun! ๐Ÿพโœจ --- ## โšก Speculative decoding (MTP draft) โ€” verified build The `MTP/` folder ships the Gemma 4 multi-token-prediction draft (unsloth's GGUF conversion of Google's official `gemma-4-12B-it-assistant`) for speculative decoding. Gemma 4 MTP is in **llama.cpp mainline** (PR #23398) โ€” no fork needed โ€” but the `gemma4-assistant` loader is **build-sensitive right now**, so please use the exact build below: - โœ… **Verified working: llama.cpp `b9553` (commit `9e3b928fd`).** I reproduced it with `gemma4-v2-Q8_0` + the `MTP-Q8_0` draft: loads cleanly and accelerates generation (~88 โ†’ ~180 tok/s on a simple deterministic prompt; expect ~1.2โ€“1.3ร— on real coding/thinking). **Lossless** either way. - โš ๏ธ **Newer builds (e.g. b9702 / b9717) currently crash** while loading the draft with `invalid vector subscript`. This is an **upstream regression** in the `gemma4-assistant` loader path, *not* a problem with these GGUFs โ€” the same files load fine on b9553. Stick with **b9553** until it's fixed upstream. Working command on b9553 (note the older flag names โ€” `--model-draft`, **not** `--spec-draft-model`): ```bat llama-server -m gemma4-v2-Q8_0.gguf ^ --model-draft MTP\gemma-4-12B-it-MTP-Q8_0.gguf ^ --spec-type draft-mtp --spec-draft-n-max 4 ^ -ngl 99 -ngld 99 -fa on --jinja ``` > โ„น๏ธ The `Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting)` line is harmless. The > draft is the generic Gemma 4 assistant (not retrained for v2), so acceptance is a touch lower than a model-specific > draft would give โ€” still 100% lossless. On small-VRAM cards, Q8 main + long context + the draft can be tight; drop to > Q6_K/Q4_K_M or a smaller `--ctx-size` if you hit OOM.