--- license: apache-2.0 base_model: google/gemma-4-12B-it library_name: gguf pipeline_tag: text-generation tags: [gemma4, coding, agentic, terminal, tool-use, reasoning, thinking, gguf, llama.cpp, local-llm] --- # ๐Ÿ’ป๐Ÿค– Gemma4-12B **v2** โ€” Coding + Agentic Edition โœจ ### ๐Ÿฃ Tiny footprint, big brain โ€” a local **coding & tool-using agent** for *everyone* > **No matter your GPU. No matter your RAM.** With **~4.5 GB** of VRAM *or* unified memory free, you can run your own > private, offline coding **agent** right now. ๐Ÿš€ v2 is the big **agentic** upgrade โ€” it reads, reasons, *uses tools*, > and works through multi-step technical tasks before it acts. ๐Ÿง ๐Ÿ› ๏ธ All local, all yours, no API, no cloud. --- ## ๐Ÿ“Š The headline โ€” it works as an agent (tau2-bench) v2 is built for **coding + agentic** work โ€” writing code, running commands, using tools, debugging, multi-step technical tasks. The clearest signal is **tau2-bench `telecom`**, an agentic tool-use benchmark whose *diagnose โ†’ fix โ†’ verify* loop mirrors real terminal/debugging work: | tau2-bench **telecom** ยท 20 tasks ยท local, same harness, **all Q8_0** | score | |---|---| | official `gemma-4-12B-it` (base) | **~15%** | | ๐ŸŸข **Gemma4-12B v2 (this model)** | **~55%** | โ†’ Roughly **3.5ร— higher** than the base model on technical-agentic tasks. ๐ŸŽฏ **Want the full story** โ€” *why* telecom, *how* the two models fail differently, the honest caveats, and the trade-offs (including general knowledge)? **It's all broken down further below. ๐Ÿ‘‡** --- ## ๐Ÿš€ Announcements **๐Ÿ”ฎ v3 is already on the way.** Honestly? Even *I* didn't expect the post-training jump to be **this** large โ€” so I'm pushing further. v3 keeps the **coding + agentic** focus and aims higher still. Stay tuned! ๐ŸŽ‰ **๐Ÿ˜ And a bigger sibling is coming โ€” Qwen3.6-27B.** I've also started fine-tuning **Qwen3.6-27B** with the same **coding + agentic** recipe, for those of you who *do* have the headroom and want more raw capability. But I haven't forgotten what this project is about: a **27B may be too heavy** for some of your GPUs / RAM. So this is **not** a replacement โ€” I'm pushing **v3 (this 12B line) in parallel, at the same time**, and it will **only get stronger**. ๐Ÿ’ช **No matter your hardware, you'll have a model that fits.** ๐Ÿ’š --- ## ๐Ÿ’š A personal note โ€” thank you, and a few honest words (please read) **First, a huge thank-you for all the data and help you've shared.** ๐Ÿ™ The bittersweet part: none of us saw it coming that **Fable 5 would be retired** โ€” and only my *own* dataset holds Fable 5's **genuine, self-authored** chain-of-thought. So for every dataset the community contributed, I **rebuilt the missing reasoning from scratch with Opus 4.8 (xhigh)**. It may diverge from the original Fable 5 traces, but it was the only workable path โ€” and the **improvement turned out really, *really* huge** (it nearly launched me out of my chair ๐Ÿ˜„). The benchmark numbers are right above. ๐Ÿ‘† **Second** โ€” I've tried to **reply to every community comment**, and I've openly **owned v1's training problems**. Truly, thank you: your feedback is what lets me improve. ๐Ÿ’š Because v1 hit **#1 trending**, it also attracted some **bad words / trolling**. I'll say this gently but firmly: **real criticism is always welcome here โ€” pure insults are not.** This is a **local** model that lets anyone run a capable AI on tiny RAM/VRAM, at **zero API cost** and fully **private**; I even open-sourced the **full safetensors master** to study and build on. If something's off, **open a discussion about the actual problem** โ€” I genuinely want to hear it and I'll act on it. But comments that are *only* insults help no one, and I'll remove them without hesitation. ๐Ÿ™ Please remember: **I'm one person** โ€” not a lab shipping an "open" model for marketing or to monetize later. I don't advertise. I build this for you on **my own time and my own money**: synthesizing data, reviewing and cleaning it by hand, splitting and re-segmenting it (this round I even built a **dynamic context-window** pass to keep the agent's *read-before-act* steps intact), reading the latest papers, then training โ†’ evaluating โ†’ training โ†’ evaluating. It burned through an **entire Claude Max 20ร— plan** (I keep a separate Pro for my own work), and **v2 alone cost 40+ hours** โ€” even with Opus 4.8, the data threw constant curveballs I had to verify myself. Thank you, truly. ๐Ÿพ --- ## ๐Ÿ”ฌ The benchmarks, in detail (tau2-bench) I evaluated v2 on **tau2-bench** (an agentic tool-use benchmark). I did **not** run the whole suite โ€” it's very time-consuming โ€” so I focused on the single domain that best matches what v2 is for. **Why tau2-bench `telecom`?** Telecom troubleshooting makes the agent **diagnose with read/inspect tools โ†’ pinpoint the issue โ†’ apply a fix โ†’ verify it** โ€” structurally the *same loop* as real terminal/debugging work (*check state โ†’ diagnose โ†’ fix โ†’ confirm*). That's exactly what this model is meant to be good at, which makes it the right yardstick for v2 (much more so than a shopping/customer-service domain). | tau2-bench **telecom** ยท 20 tasks ยท local, same harness, **all Q8_0** | score | |---|---| | official `gemma-4-12B-it` (base) | **~15%** | | ๐ŸŸข **Gemma4-12B v2 (this model)** | **~55%** | โ†’ Roughly **3.5ร— higher** than the base model on technical-agentic tasks. ๐ŸŽฏ **Grounded, not made-up.** Independently, a coding/terminal *fabrication probe* (tasks that deliberately tempt the model to invent file paths / function signatures / values) found v2 **grounds before it acts** just like the base โ€” it `grep`/`read`/`ls` first, and **doesn't make things up** (0% fabrication, on par with the base model). **The interesting part โ€” *how* they fail.** The **base model gives up early**: on this run it bailed to a human agent **10 times** (`transfer_to_human`) instead of finishing the fix. **v2 keeps going** โ€” it stays in the loop and works the problem the way a much bigger model would, which is exactly why it solves so many more. It's not perfect yet: it still **flails a little sometimes** (over-trying, retrying). And some of the remaining misses are actually a **bug in the benchmark's own APN tool** (it throws on inputs it should handle gracefully), not the model. To be clear: **I will not patch the benchmark's tools or leak its test questions just to inflate my score** โ€” I'd rather report an honest number and improve the *model* itself. **More training is coming in v3.** ๐Ÿ”ง **About `retail` (customer-service shopping):** on tau2-bench *retail*, the base model scores a bit higher than v2. **This is fully expected and by design.** Retail is pure customer-service (look up a user, process an order) โ€” *not* what this model is for. v2 is specialized for **coding / terminal / technical-agentic** work, and on those (telecom) it dramatically outperforms the base. Need a customer-service bot? This isn't it. Need a **local coding/agentic** model? It is. ๐Ÿ’š **Let's keep it honest about scale.** Today's *frontier* models โ€” think **mimo-v2.5-pro** or **Opus 4.8** โ€” all land **90%+** on this telecom benchmark. They're also *enormous*. For a **12B** model, my rough *guess* is that v3 might top out somewhere around **60โ€“70%** (emphasis on *guess* โ€” I haven't even started v3 yet). So let's be clear-eyed: there's still a real gap to the frontier. But keep the scale in mind โ€” **this is a 12B model running on your own machine**, and narrowing that gap as much as possible *at this size* is the whole point. ๐Ÿ’ช **And the trade-off โ€” there's no free lunch.** I also ran a general-knowledge benchmark (**MMLU-Pro**), and v2 lands **a little below the base model** there. That's **completely normal and expected** for a focused fine-tune: when you push hard on coding + agentic, you trade a sliver of broad-knowledge breadth for it. Need a generalist? Try my own general-purpose **[Claude Opus 4.6/4.8 distillation](https://huggingface.co/yuxinlu1/gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUF)** โ€” or the original **`google/gemma-4-12B-it`** base. Need a **local coding/agentic** worker? That's what v2 is tuned for. > ๐Ÿ”ฌ *Methodology, honestly:* these are **local, same-harness, relative** numbers (**all models tested at Q8_0**, greedy > decoding, self-simulated user, 20 tasks). They are **not** directly comparable to published tau2-bench leaderboard > figures (different user-simulator, full task sets, full precision) โ€” local self-eval runs *systematically lower* than > published scores. Read them as **"v2 vs the base model under identical conditions"**, which is the comparison that > actually matters here. --- ## ๐Ÿ“š What's new in v2 (training) v2 continues from the v1 coder and adds a big **agentic** push โ€” the piece v1 was missing: - **๐Ÿ› ๏ธ Agentic / terminal** โ€” real **multi-step tool-use** trajectories (*read โ†’ reason โ†’ act โ†’ verify*), in Gemma 4's native tool protocol. This is what drove the tau2-bench telecom jump, and it fixes v1's "stops after the first step" behavior. - **๐Ÿ’ป Coding** โ€” verified chain-of-thought over Python tasks (**real CoT, gated on passing tests**) plus the Fable-5-redo set for the hard cases. - **๐Ÿ“š General** โ€” a curated slice of reasoning/instruction data to keep broad competence. All reasoning is **distilled CoT** (see the personal note above on how the Fable 5 traces were rebuilt with Opus 4.8). --- ## ๐Ÿ“ฆ Pick your size (GGUF quants) | Quant | Size | Vibe | |------|------|------| | ๐ŸŸข **Q2_K** | **4.5 GB** | tiniest โ€” runs almost anywhere | | ๐ŸŸก **Q3_K_M** | **5.7 GB** | great for 8 GB VRAM | | ๐Ÿ”ต **Q4_K_M** | **6.87 GB** | the sweet spot ๐Ÿ‘Œ (recommended) | | ๐ŸŸฃ **Q6_K** | **9.11 GB** | near-lossless | | โšช **Q8_0** | **11.8 GB** | basically full quality | > ๐Ÿงฐ The **full-precision `safetensors` master** is open too โ€” roll your own GGUF / MLX / AWQ quants or fine-tune on top. --- ## ๐Ÿš€ How to run it ### Option A โ€” llama.cpp (recommended) ๐Ÿฆ™ > โš ๏ธ Needs a **recent llama.cpp** (this is the `gemma4_unified` architecture โ€” older builds won't load it). ```bat @echo off cd /d C:\llama.cpp llama-server.exe ^ -m C:\models\gemma4-v2-Q4_K_M.gguf ^ --ctx-size 16384 ^ --n-gpu-layers 99 ^ --no-mmap -fa on ^ --jinja ^ --temp 1.0 --top-p 0.95 --top-k 64 ^ --host 0.0.0.0 --port 18080 pause ``` - **๐Ÿ› ๏ธ Agentic use:** pass your tools via the OpenAI `tools` field (works with `--jinja`). v2 emits structured tool-calls in Gemma 4's native protocol and is happy in agent loops (read/grep/edit/run, then verify). - **๐Ÿ–ฑ๏ธ One-click apps:** LM Studio / Jan / Ollama โ€” import the GGUF, pick a quant, go. ### ๐Ÿง  Thinking mode v2 thinks in Gemma's native thought channel before answering (keep `enable_thinking=true`, the default chat template handles it). Recommended sampling: `temp 1.0, top_p 0.95, top_k 64`; for coding you can also go greedy (`temp 0`). --- ## โš ๏ธ Good to know - **Specialized for coding / terminal / agentic.** General-knowledge facts/numbers should still be double-checked. - **Reduced refusals:** task-focused training, not safety-aligned โ€” add your own guardrails for production. Use responsibly. ๐Ÿ™ - English-centric. --- ## ๐Ÿ“š Base & License - **License: Apache 2.0.** Gemma 4 is released by Google under **[Apache 2.0](https://ai.google.dev/gemma/apache_2)** (unlike the older Gemma 1/2/3 terms), so this fine-tune is **Apache 2.0** too โ€” free to use, modify, and redistribute. ๐ŸŽ‰ - **Base model:** [`google/gemma-4-12B-it`](https://huggingface.co/google/gemma-4-12B-it). - Personal/hobby project โ€” shared as-is, no warranty. Built with time, care, and a lot of coffee. Have fun! ๐Ÿพโœจ