My model is acting crazy in llama.cpp

#44
by Milor123 - opened

This model turns out to be strangely unusable for me. I am using llama.cpp, I have tried to follow the options you recommended and even some mentioned here in other discussions, I am using the atomic fork (https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant/)

image

I have an RTX 4070 with 12GB of VRAM; I don't have this problem with other models.

.\llama-server.exe -m "C:\Users\User\.vllm\gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2\gemma4-v2-Q4_K_M.gguf" --host 127.0.0.1 --port 10000 -ngl 99 -c 256000 -b 8192 -ub 2048 --no-mmap --direct-io --temp 1.0 -np 1 -fa on --top-k 64 --repeat-penalty 1.0 --top-p 0.95 --api-key XXXXX --cont-batching --metrics -ctv turbo4 -ctk turbo4 --jinja -tb 19 -t 19 --poll 100 --cpu-strict 1 --n-cpu-moe 4 --no-warmup --cache-reuse 512 --cache-ram -1 --checkpoint-every-n-tokens 65536

I even searched and found a chat template that other users recommended, so I added it --chat-template-file "C:\Users\User\.vllm\gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2\gemma4.jinja"

The model not only fails to use the tools correctly, but it doesn't understand anything at all; it's not doing the bare minimum, it's as if it were crazy haha, it doesn't load the skills, it doesn't use mcp, it has neither sense nor logic. I tried changing the temperature to 1.0, 0.5, 0.1, and I also defined the temperature for the model in my opencode.

I also tried changing the repetition penalty, but it doesn't yield any useful results

What could it be? What am I doing wrong?

Good news: this is almost certainly your launch config, not the weights. That word-salad output is the classic
signature of a corrupted KV cache, and your command has the culprit right in it.

The main problem is -ctk turbo4 -ctv turbo4. That's the TurboQuant fork's 4-bit KV cache quantization. Gemma 4's KV
cache is very sensitive to quantization β€” even standard Q4_0 KV noticeably degrades it, and a 4-bit experimental
format pushes it into the incoherent output you're seeing. It's also why changing temperature / repeat-penalty does
nothing: this isn't a sampling problem, the attention state itself is being corrupted, so no sampler setting can
recover it.

Three more issues in your command:

  • --n-cpu-moe 4 β€” Gemma 4 12B is a dense model, not MoE. That flag is only for MoE models and shouldn't be applied
    here.
  • -c 256000 β€” 256K context won't fit on a 12GB card with everything offloaded (-ngl 99); it forces the runtime into a
    bad state. Keep the context modest on a 4070.
  • --chat-template-file gemma4.jinja β€” drop it. With --jinja the model already uses its own embedded template, which is
    the correct one for tool calls. A third-party template will break tool parsing.

Try this minimal, known-good command first (FP16 KV cache, no MoE flag, sane context):

llama-server.exe -m "...\gemma4-v2-Q4_K_M.gguf" --host 127.0.0.1 --port 10000 -ngl 99 -fa on -c 16384 --jinja --temp
1.0 --top-k 64 --top-p 0.95 --repeat-penalty 1.0 --api-key XXXXX

Once that gives coherent output and tools work, add the speedups back one at a time so you can see which one breaks
it. If you want to keep TurboQuant for the memory savings: on Gemma 4 the safe KV setting is FP16, and if you must
quantize the KV use q8_0 β€” not turbo3/turbo4. Q4_K_M weights are perfectly fine on a 12GB card; the weight quant isn't
your problem here, the KV quant is. (If 16K context OOMs, drop to -c 8192.)

Sign up or log in to comment