Great agentic jump, but very slow for conversational/chat use β€” latency data inside

#11
by TILK - opened

Hi,

First, congratulations on v2 β€” the tau2-bench telecom jump (~15% β†’ ~55%) is genuinely impressive, and it's awesome to have a capable coding/agentic 12B that fits on modest hardware. I wanted to share some honest real-world feedback from running it as a personal assistant (Telegram chat bot + scheduled tasks), in case it's useful for v3.

My setup
Quant: Q4_K_M (your recommended sweet spot)
Backend: Ollama (ROCm, AMD GPU), Flash Attention on, KV cache q8_0, ctx 32K
Samplers per your model card: temp 1.0 / top_p 0.95 / top_k 64 / rep_pen 1.1
Running through an agent framework (nanobot), so the model sees tools + system prompt on every turn

The headline

It works, but for reactive/conversational use it's dramatically slower than the base gemma-4-12B-it. I measured end-to-end latencies (latency_ms logged per assistant turn) by swapping the base model for v2 and back:

  • Scenario Base gemma-4-12B-it v2 (this model)
  • Simple news briefing request (Telegram) 33 s 413 s (~7 min)
  • Memory-consolidation "nothing to do" (Dream) 44–60 s 45–117 s
  • Same task, but v2 generates up to 15 200 chars of reasoning 0 3 700–15 200 chars
  • So ~12Γ— slower on a typical chat turn, even though the base model produced the same kind of answer.

What I observed

Two behaviors seem to drive the latency:

Very verbose thinking, even when nothing needs thinking. The thought channel fires on every turn, including trivial ones. A memory-consolidation run with literally nothing in the history produced a ~15K-char internal monologue ("Wait... Actually... Self-correction...") before concluding there was nothing to do. The base model just returns immediately.
"Keep going" becomes "keep trying" on failed tool calls. On the news request, v2 repeatedly called web_fetch, each one blocked ("repeated external lookup blocked"), and instead of backing off it kept reasoning over the failures. That perseverance is clearly the agentic strength you tuned for β€” but in a chat assistant it turns into a loop.
I did follow your pinned notes (rep_pen 1.1, temp 1.0, native tool format), so this doesn't look like a sampler misconfig β€” it reads more as the model wanting to think/try hard by default.

Suggestions for v3 (if useful)
An easier path to disable or cap thinking for non-agentic turns (e.g. a chat-template toggle or a "concise" mode), so the same weights can serve both agentic and chat workloads.
Early-stop on repeated tool failures β€” after N blocked/identical tool calls, fall back to a direct answer rather than looping.
Maybe two recipes: the full "keep going" agentic one, and a lighter "chat-friendly" fine-tune.
Bottom line
For agentic/coding loops where you want it to grind through a problem, v2 is excellent. For a snappy conversational assistant, the base model is currently a better fit. I'd love a chat-friendly variant β€” I'd switch back in a heartbeat.

Thanks for the work and for being so open about the trade-offs. Looking forward to v3!

Hi, thank you for this. Honestly it's one of the most useful reports I've gotten. Actual latency numbers plus a clean
A/B against the base model is exactly what helps me. And your read is right: this isn't a sampler issue, it's the
model doing what v2 was tuned to do. The agentic data taught it to think hard and keep going, and on a trivial chat
turn that turns into a 15K-char monologue or a retry loop on a blocked tool. You basically reproduced from the outside
the exact thing I'm targeting for v3, where the failure mode is over-reasoning and re-issuing the same failed tool
call, not lack of knowledge.

A few things that might help right now, before v3:

On the thinking: Gemma 4's thinking channel is meant to be gateable per turn, and Ollama exposes it as a request
parameter (think: false, a thinking_budget, or /no_think in the prompt). If nanobot can set that, you could leave
thinking on for agentic turns and turn it off or cap the budget for plain chat turns. Same weights, two modes. Two
caveats from the field: to make "off" reliable you usually have to hint it in both the system and the user message,
and full-off can be finicky on Gemma 4, so capping the budget is often the more reliable lever. Also make sure prior
thought blocks aren't being fed back into history, only the final answer should carry over, that alone trims a lot of
the bloat. One known gotcha: some llama.cpp/Ollama builds flood tokens when thinking is disabled, so update
if you hit that.

On the tool loop: that part is partly the harness. A guard that caps identical or blocked tool calls (after N repeats,
stop and answer directly) will fix the news-fetch loop today regardless of the model. But you're right that the model
should learn to back off on its own, and that's squarely on the v3 list.

For v3 I'm already prioritizing exactly this execution-robustness side: back off after repeated failures, and don't
over-reason trivial turns. That should help conversational latency a lot even without a separate model. A dedicated
chat-friendly recipe is something I'm genuinely weighing too, since the "keep going" intensity is great for coding
loops and too much for a snappy assistant. Your "I'd switch back in a heartbeat" is a strong signal, noted.

Thanks again. This is the kind of feedback that actually moves the next version.

yuxinlu1 pinned discussion

Great !

I don't have anything useful to add here but I do want to say that this is a part of the open source community that I love. Honest feedback with iteration and openness.

Love the model and agree the speed bump would be fantastic with the reasoning enhancements. I'll try to pull some of the levers you mentioned.

Sign up or log in to comment