yuxinlu1 commited on
Commit
190a313
Β·
verified Β·
1 Parent(s): e8a1d38

Add Speculative decoding (MTP draft) section: verified build llama.cpp b9553 (9e3b928fd), regression note for b9702/b9717

Browse files
Files changed (1) hide show
  1. README.md +29 -0
README.md CHANGED
@@ -207,3 +207,32 @@ handles it). Recommended sampling: `temp 1.0, top_p 0.95, top_k 64`; for coding
207
  **Apache 2.0** too β€” free to use, modify, and redistribute. πŸŽ‰
208
  - **Base model:** [`google/gemma-4-12B-it`](https://huggingface.co/google/gemma-4-12B-it).
209
  - Personal/hobby project β€” shared as-is, no warranty. Built with time, care, and a lot of coffee. Have fun! 🐾✨
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
207
  **Apache 2.0** too β€” free to use, modify, and redistribute. πŸŽ‰
208
  - **Base model:** [`google/gemma-4-12B-it`](https://huggingface.co/google/gemma-4-12B-it).
209
  - Personal/hobby project β€” shared as-is, no warranty. Built with time, care, and a lot of coffee. Have fun! 🐾✨
210
+
211
+ ---
212
+
213
+ ## ⚑ Speculative decoding (MTP draft) β€” verified build
214
+
215
+ The `MTP/` folder ships the Gemma 4 multi-token-prediction draft (unsloth's GGUF conversion of Google's official
216
+ `gemma-4-12B-it-assistant`) for speculative decoding. Gemma 4 MTP is in **llama.cpp mainline** (PR #23398) β€” no fork
217
+ needed β€” but the `gemma4-assistant` loader is **build-sensitive right now**, so please use the exact build below:
218
+
219
+ - βœ… **Verified working: llama.cpp `b9553` (commit `9e3b928fd`).** I reproduced it with `gemma4-v2-Q8_0` + the
220
+ `MTP-Q8_0` draft: loads cleanly and accelerates generation (~88 β†’ ~180 tok/s on a simple deterministic prompt;
221
+ expect ~1.2–1.3Γ— on real coding/thinking). **Lossless** either way.
222
+ - ⚠️ **Newer builds (e.g. b9702 / b9717) currently crash** while loading the draft with `invalid vector subscript`.
223
+ This is an **upstream regression** in the `gemma4-assistant` loader path, *not* a problem with these GGUFs β€” the same
224
+ files load fine on b9553. Stick with **b9553** until it's fixed upstream.
225
+
226
+ Working command on b9553 (note the older flag names β€” `--model-draft`, **not** `--spec-draft-model`):
227
+
228
+ ```bat
229
+ llama-server -m gemma4-v2-Q8_0.gguf ^
230
+ --model-draft MTP\gemma-4-12B-it-MTP-Q8_0.gguf ^
231
+ --spec-type draft-mtp --spec-draft-n-max 4 ^
232
+ -ngl 99 -ngld 99 -fa on --jinja
233
+ ```
234
+
235
+ > ℹ️ The `Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting)` line is harmless. The
236
+ > draft is the generic Gemma 4 assistant (not retrained for v2), so acceptance is a touch lower than a model-specific
237
+ > draft would give β€” still 100% lossless. On small-VRAM cards, Q8 main + long context + the draft can be tight; drop to
238
+ > Q6_K/Q4_K_M or a smaller `--ctx-size` if you hit OOM.