thank you

#2
by arc2745 - opened

Thank you for your contribution, in the very near future ai labs may stop releasing open weights models (we already saw it with qwen 3.6 not finishing the entire family line "qwen 3.6 9B" and 3.7 being api only)

People like you are our only hope for getting better models that can fit consumer level hardware. You deserve every bit for being on the trending page on huggingface.

I'm looking forward for your v3 drop and for more benchmarks :)

arc2745 changed discussion title from terminal bench 2.0 to thank you

@arc2745 β€” thank you, this really means a lot. πŸ’š

You're right that the open side feels shakier lately β€” 3.6 came out open, 3.7 is API-only for now.

Either way, I'm going to keep building these. It's what I do, and I really want to find out how far a small model can
go on the hardware people already own β€” I don't think we've hit that limit yet. And there are a lot of people putting
out capable models for free; as long as that keeps up, anyone can run something strong on their own machine.

v3 is underway, more benchmarks coming. Thanks again. πŸ™

Any recommendations for hermes or opencode? I haven't had much success without 64k context in hermes, for example.

Hey! Both work with v2 since it's an OpenAI-compatible endpoint β€” here's what I'd tune for each.

On context: you're not imagining it. These agent harnesses have huge system prompts, and Hermes Agent piles persistent
memory + skills on top, so it genuinely eats context fast. 32k is the realistic floor and 64k is comfortable β€” below
that the system prompt crowds out your actual prompt and the model starts flailing. Good news is you have tons of
headroom: v2 goes up to 256k context, so just start llama-server with -c 65536 (or 32768 if VRAM is tight). The only
real cost is KV cache, which on a 12B is manageable.

On tool calling β€” this is the part people get wrong most. Gemma 4 has its own native tool format, so you want a server
that actually parses it:

  • llama.cpp: llama-server --jinja is the most reliable β€” it has the proper parser for Gemma 4's native tool calls.
  • Ollama: make sure you're on v0.21.0+, that version has the critical Gemma 4 tool-call fix opencode relies on.
  • opencode with llama.cpp: in opencode.json set tool_call on and the toolParser to raw-function-call + json.

Sampling matters too β€” use temp 1.0, top_p 0.95, top_k 64, rep_pen 1.1. With no rep penalty it can fall into
repetition loops on long outputs.

One thing that really helps a 12B: keep the tool list small and give each tool explicit, exact parameter names in its
schema. Gemma 4 is sensitive to that, and a lean tool set cuts down on it inventing or mangling calls. v2's agentic
training is coding/terminal style, so it shines with a focused tool set more than a 15-tool mega-harness.

yuxinlu1 pinned discussion

Thanks for the model and your contribution for the community, it means a lot!

A big thank you too !

I was waiting for the v2 since I tried the v1. You are part of the pioneers of something big, hats off πŸͺ

Sign up or log in to comment