wacky output with Q2_K

#4
by ehurrn - opened

I said hello
it said:

Screenshot 2026-06-18 at 9.10.09 PM

Thank you so much for reporting this — genuinely helpful. 🙏 You caught a real bug: the Q2_K quant is broken (it
collapses into gibberish, exactly what you saw). I just re-tested every quant on llama.cpp, and the rest are healthy —
Q3_K_M, Q4_K_M, Q6_K and Q8_0 all generate cleanly. So for now please grab Q4_K_M (the recommended one) or higher and
you'll be good. I'll re-build Q2_K properly and re-upload it shortly. Thanks again for the catch! 🛠️

yuxinlu1 pinned discussion

Update on the Q2_K issue — and thanks again for catching it 🙏
I traced it to the Q2_K quant specifically: Gemma 4's huge vocabulary makes a plain Q2_K collapse into gibberish. I rebuilt it with an imatrix, which fixed the obvious garbling… but when I stress-tested it on longer, more complex generations (a full interactive web page, etc.) it still gave me headaches — it gets shaky on the harder stuff. Q3_K_M holds up far better, so I've decided to hold Q2_K back from this release until I've actually solved every issue, rather than ship something flaky. For now: Q3_K_M is the smallest reliable option, Q4_K_M is the sweet spot I'd recommend, and Q3/Q4/Q6/Q8 all check out fine. Appreciate your patience! 🛠️

Makes sense, hard to pack that much into Q2

I always feel like using a Q4 / IQ4 is like using an "almost there" model, I've never tried a Q2, but IQ3 is the lowest I've tried and found them basically unusable (Not this model specifically, but other Gemma4's)

@bleaki Your instinct's right, and it bites harder on Gemma 4 specifically. The vocab is huge (262k tokens), so the embedding/output layers are a big slice of the weights — and those are exactly what degrades most at low bits. That's why a plain Q2_K collapsed here, and why IQ3 feels rough on Gemma 4s in general. An imatrix recovers a lot at the low end (it un-garbled my Q2), but below ~Q3 it's still too shaky for harder generations, which is why I held Q2 back.

For "actually there": Q4_K_M is the sweet spot, and Q6_K is basically indistinguishable from Q8 if you want max fidelity short of full precision. If you prefer I-quants, I can add imatrix IQ4_XS / IQ3_M — better quality-per-bit at the low end than the same-size K-quant. Happy to push those if there's interest.

Sign up or log in to comment