Thank you to a fellow open-source dev for the rigorous testing β€” corrected sampler guidance (long-form generation)

#25
by yuxinlu1 - opened

A community member put real, reproducible work into stress-testing v2 on long-form code generation, and I want to give credit where it's due.

First, the generous part, because I mean it: I'm always happy to help people share their work. If you've built something good, post it in its own thread β€” I'll actually test it myself, and if it holds up, I'll pin it for you. That offer is open to anyone. The only thing I ask is to keep promotion separate from a review billed as "independent" β€” a community evaluation should be public and honest, not a vehicle for recommending your own model. That's the single reason I moved the self-recommendation; the testing itself, I'm genuinely grateful for.

So I ran a controlled multi-seed test myself (Q4_K_M, the exact community inference settings) to get real numbers instead of trading theories. Here's what I found, and a correction to my earlier guidance.

What I found (5 seeds per setting, same prompt):

  • On a very long, dense one-shot generation (a full single-file dashboard), stacking repetition penalties hurts. At rep_pen 1.1 β€” and worse with DRY 0.8 β€” the model truncated or degraded before finishing in every seed. My earlier blanket advice to "always use rep_pen 1.1 + DRY" was wrong for this kind of task, and I'm correcting it. - At rep_pen 1.0 (no penalty), the same task completed cleanly in most seeds, and the dramatic collapse case did not reproduce across 5 seeds β€” so it's a rare event, not the default behavior, and not a sign of damaged weights.

Corrected guidance:
- Long one-shot code/HTML generation: rep_pen 1.0 (no penalty), no DRY. Keep temp 1.0 / top_p 0.95 / top_k 64.
- Short outputs, chat, agentic tool-calling: a mild rep_pen (~1.05–1.1) is fine and helps with the occasional repetition without hurting short structured output.

Honest limitation: v2 is specialized for agentic, multi-step tool-loops β€” not one-shot generation of massive single-file projects, which sits at the edge of any 12B's single-pass ability. For big one-shot frontends today, base Gemma 4 12B or v1 may serve you better.
What's next: this testing was on the Q4_K_M quant. I'm going to dive into this properly β€” pin down exactly where the long-form degradation comes from (quant level, the penalty interaction, or the weights themselves) and fix it at the source. If it turns out to be a model-weight issue, I'll ship an updated build within two days.
Thanks again for the careful testing β€” this is exactly the kind of feedback that improves the next version. πŸ™

Sign up or log in to comment