- Qwen3.6-27B-DSV4Pro-Thinking-Distill
- 🇬🇧 English
- Training details
- Attribution (the method is not original — it is a combination of published techniques)
- Evaluation (Q4_K_M, archived harness): same harness, thinking-on, vs. the original Qwen3.6-27B
- Q5_K_M evaluation — distill vs base, streaming harness (re-run)
- MTP (multi-token prediction) single-stream acceleration — measured best config + lossless
- NVFP4 Quantization (vLLM / SGLang high-concurrency + MTP)
- Eval protocol
- Limitations
- Files
- Inference
- Training details
- 🇨🇳 中文版
- 训练配置(如实披露)
- 方法非自创,是公开技术的组合(如实归因)
- 评测(Q4_K_M,旧 harness):同一 harness,thinking-on,vs 原版 Qwen3.6-27B
- Q5_K_M 评测 —— 蒸馏 vs 原版,流式 harness(重测)
- MTP(多 token 预测)单流加速 — 实测最佳配置 + 无损
- NVFP4 量化(vLLM / SGLang 高并发 + MTP)/ NVFP4 (vLLM/SGLang + MTP)
- 评测口径 / Eval protocol
- 局限 / Limitations
- 文件 / Files
- 推理 / Inference
- Claude Code(实验性 / experimental)
- Claude Code (experimental)
- 训练配置(如实披露)
Qwen3.6-27B-DSV4Pro-Thinking-Distill
Lynn Agent edge runtime
This 27B distillation family is the recommended local reasoning model source for Lynn Agent. For desktop and edge use, Lynn recommends the GGUF sibling repo and defaults to Qwen3.6-27B-DSV4Pro-Distill-MTP-Q5_K_M-imatrix.gguf in v0.85.6+.
- Download Lynn Agent: GitHub Releases (current desktop/CLI release: v0.85.6)
- Recommended edge GGUF: Hugging Face 27B GGUF / ModelScope 27B GGUF
- Lynn default quant: Q5_K_M imatrix + native MTP; lower-config users can manually downgrade to 9B / 4B in Lynn settings.
🇬🇧 English · 🇨🇳 中文 ⬇️
🇬🇧 English
On Qwen3.6-27B (Dense, 64 layers, Gated DeltaNet linear/full-attention hybrid), we use LoRA to distill the way DeepSeek-V4-Pro reasons (with thinking-on) plus its agentic behavior.
This is the Dense counterpart of the 35B-A3B (MoE) sister model: same R6000 GPU, same teacher, same recipe, swapped onto a Dense architecture — proving the gains come from the distilled thinking style, not an MoE architectural bonus. A native MTP head is welded on for single-stream acceleration.
⚠️ Distilling a thinking style ≠ distilling knowledge/capability: the goal is "learn how to reason and how to converge", not to inject knowledge or raise the capability ceiling.
Training details
- Base: Qwen3.6-27B (Dense, BF16 base)
- Method: LoRA, r = 64, α = 128, dropout = 0.05, targets = all attention + MLP projections
- Optim: paged_adamw_8bit, cosine LR, warmup 0.03, ~1 epoch
- Teacher: DeepSeek-V4-Pro (thinking-on + agentic)
- Data: ~1842 distillation samples (lynn_prod spec). Trajectories = DS-V4-Pro multi-step reasoning under thinking-on (
<think>) + ReAct-style tool calls (think one step → call one tool → observe → loop).- The tool "execution results" are SIMULATED, not actually run: in the multi-turn tool calls, each "execution result" line is improvised by a small, fast model (DeepSeek-V4-Flash) role-playing the "runtime" — not obtained by actually running code in a sandbox. So it differs from real execution.
- Training masks those fabricated results — the model learns only "how to think / how to call tools", not the made-up outputs: because the results are fake, training on them would teach the model the bad habit of fabricating tool return values; so we optimize only the model's own "reasoning + tool-call" tokens.
- Artifacts: merged → BF16 safetensors →
gguf/Q4_K_M-imatrix (with native MTP)
Attribution (the method is not original — it is a combination of published techniques)
- ReAct (interleaved reasoning + acting): Yao et al., 2022, arXiv:2210.03629 (ICLR 2023)
- STaR (bootstrapping reasoning traces): Zelikman et al., 2022, arXiv:2203.14465
- Self-Instruct / Baize self-chat: Wang et al., 2022; Xu et al., 2023, arXiv:2304.01196
- AgentTuning: Zeng et al., 2023, arXiv:2310.12823
- ToolBench / ToolLLM (tool use): Qin et al., 2023, arXiv:2307.16789
- DeepSeek-R1 reasoning distillation: DeepSeek-AI, 2025, arXiv:2501.12948
Evaluation (Q4_K_M, archived harness): same harness, thinking-on, vs. the original Qwen3.6-27B
Quantization parity (important — prevents misreading): this model and the base are both Q4_K_M (imatrix-corrected) GGUF + native MTP, fully same-spec — same Q4_K_M, same imatrix, same MTP. The only variable is "distilled or not". This is not distilled-Q4_K_M vs base-BF16 (which would be unfair); the Δ below is cleanly attributable to distillation itself, with no quantization difference mixed in.
| Dimension | This model (distill) | Original base | Δ |
|---|---|---|---|
| GPQA-Diamond-198 | 80.81% (160/198, 32K) | 73.7% (146/198) | +7.1 |
| MMLU-500 (5-shot) | 91.8% (459/500) | 91.6% | +0.2 |
| GPQA unconverged empty answers (parse_fail) | 0 | 14 | −14 |
| coding-100 (10 langs × 10) | 86/100 | 83/100 | +3 |
| Agentic SOLO (20 complex tasks) | 16/20 | 13/20 | +3 |
Reading: hard reasoning improves markedly (GPQA +7.1pp, 160/198 = 80.81%, 0 error / 0 parse_fail), and knowledge does not drop — it even nudges up (MMLU +0.2pp), while "finish thinking, then converge" holds — GPQA unconverged empty answers fall from 14 to 0 (the base's 14 were all cases that thought to the 32K limit without ever giving an answer; after distillation, zero). Median generation length is compressed to ~3006 tokens. Note: the 35B-A3B distill lost 1.6pp MMLU, whereas the 27B Dense has more capacity — it fits the distillation without crowding out knowledge: GPQA up, MMLU not down, a cleaner "pure gain".
coding-100: same harness, a real sandbox runs the code and checks whether the tests actually pass (objective). distill 86 ≥ base 83 — coding ability did not drop, and is slightly higher.
Agentic SOLO: the model orchestrates + executes 20 complex tasks by itself; judge = the task/harness author (who knows best whether it was "actually done"). distill 16 > base 13. ⚠️ This metric is judge-subjective (a stricter judge ties the two), so treat it as a trend — the hard numbers are GPQA / coding.
Q5_K_M evaluation — distill vs base, streaming harness (re-run)
Protocol (annotated — DIFFERENT from the Q4_K_M section above; do not cross-compare tiers): both this distill and the base are Q5_K_M-imatrix GGUF + native MTP, same-spec, served base-mode (MTP off) for a concurrency eval. Distill harness = SSE streaming (
stream=True), timeout 1800s · concurrency 4 · max_tokens 32000, thinking-on, temp 0.6 / top_p 0.95,finish_reasonlogged per question. Base GPQA result is from the original conc=4 run (non-streaming), butfinish_reasondata confirms 0 errors (zero false timeouts) — the 12lengthhits each showcompletion_tokens=32768, i.e. genuinely unconverged, not harness artifacts. Base MMLU re-run uses the same SSE streaming harness as distill.
| Dimension | Distill Q5_K_M | Base Q5_K_M | Δ |
|---|---|---|---|
| GPQA-Diamond-198 | 81.82% (162/198) | 68.69% (136/198) | +13.13pp |
| MMLU-500 (5-shot) | 90.0% (450/500) | 89.6% (448/500) | +0.4pp |
GPQA finish=stop (converged) |
198 / 198 | 186 / 198 | |
GPQA finish=length (hit 32K wall, never answered) |
0 | 12 | |
| GPQA errors (timeout/etc.) | 0 | 0 |
Reading: under the streaming harness (zero false-timeouts, confirmed by errors=0), the distill converges on every single question (198/198 stop, 0 length), while the base runs into the 32K wall on 12 questions (length, never produces an answer). This is the hardest, cleanest evidence of the distillation's "learn to converge / 收口" effect — now quantified by finish_reason, not just accuracy.
⚠️ Do NOT cross-compare quant tiers: the Q4_K_M table uses an older (non-streaming) harness; this Q5_K_M table uses the streaming harness. Comparing e.g. base-Q4 vs base-Q5 across tiers is meaningless (harness differs). Only the within-tier distill-vs-base Δ is valid.
MTP (multi-token prediction) single-stream acceleration — measured best config + lossless
Two MTP paths: GGUF via llama.cpp (
--spec-type draft-mtp, below); BF16 / FP8 safetensors via vLLM / SGLang (--speculative-config '{"method":"mtp","num_speculative_tokens":3}') — both the BF16 and FP8 repos now bundle the native nextn head (SGLang-measured accept 0.76–0.88; BF16 suits Ampere-class cards without native FP8).
This model's gguf contains a native MTP head (mainline llama.cpp --spec-type draft-mtp; no -md / external draft model needed).
Best config measured (single-stream, Q4_K_M-imatrix; tested on DGX Spark GB10, unified-memory bandwidth-bound — Mac / Blackwell RTX-50 (FP4) can be faster):
--spec-draft-n-max (p-min=0) |
single-stream TPS | draft accept rate | mean accept len |
|---|---|---|---|
| bare no-MTP | 10.4 | — | — |
| n-max=2 | 24.1 | 0.82 | 2.64 |
| n-max=3 ⭐ (recommended) | 26.8 | 0.72 | 3.16 |
| n-max=4 | 27.4 | 0.65 | 3.62 |
- 2.3–2.6× single-stream speedup (vs bare 10.4 TPS); n-max=3 is the throughput/accept-rate balance point.
- Greedy speculative decoding is lossless by construction: it only accepts the target-argmax token. Batched-verify GEMM rounding produces character-level differences on near-tie tokens — this is FP non-determinism, not quality loss (any two independent runs show it, even with MTP off).
- Speculation is a single-stream latency tool; concurrency degrades it (spec tokens take up KV/batch capacity) — for throughput scenarios use bare multi-concurrency mode.
- Recommended launch:
llama-server -m *-MTP-Q4_K_M-imatrix.gguf --spec-type draft-mtp --spec-draft-n-max 3 --jinja
Note: single-stream TPS varies by content — coding prompts accept ~0.72 → 26.8 t/s, reasoning prompts ~0.58 → 24.6 t/s (n-max=3, all measured on DGX Spark GB10). The current MTP is the base's native nextn head grafted on (lossless); the base head predicts a bit weakly on the post-distillation reasoning distribution, so the reasoning accept rate is lower. A distill-specific retrained MTP head (to pull accept back to ~0.8) is on the roadmap.
📏 GB vs GiB. File sizes on the sibling GGUF repo are decimal GB (10⁹ bytes), but GPU VRAM is built in GiB (2³⁰) and merely labeled "GB" — so a "24 GB" card is really 24 GiB ≈ 25.8 GB. Rule of thumb: a file runs with no CPU offload when its GiB size < the card's nominal "GB" number (e.g. a 22.4 GB = 20.9 GiB quant fits a 24 GB card with room for KV).
NVFP4 Quantization (vLLM / SGLang high-concurrency + MTP)
Multi-tier NVFP4 quantization + the official MTP head live in a dedicated NVFP4 repo (ModelScope · HF nerkyor), for vLLM / SGLang deployment.
⚠️ Speeds @ RTX PRO 6000 Blackwell (sm120) + vLLM; the GGUF speeds above are @ Spark GB10 + llama.cpp — different hardware, don't cross-read.
| Tier | Directory | GPQA-D 198 | MMLU-500 | Size | MTP peak single-stream @R6000 | Rec. N |
|---|---|---|---|---|---|---|
| W4A16 | root | 82.83% | 87.80% | 29G | 93.7 tok/s (2.08×) | 4 |
| W4A4 | w4a4/ |
77.27% | 91.40% | 20G | 111 tok/s (1.76×) | 2 |
| W4A8 | pending engine support | ≈W4A16 (expected) | — | ~19G | — | — |
- W4A16: quality-first (only MLP weights quantized; attention / Mamba / vision / embeddings /
lm_head/ norms kept BF16). W4A4: speed/VRAM-first (MLP + attention + GDN projections quantized, conv1d kept BF16). - Both embed the official nextn MTP head; MTP stays net-positive under concurrency too (W4A4 still +44% at c16; break-even concurrency is >16 — earlier "turn off MTP at high concurrency" advice is superseded).
- W4A8 (weight FP4 + act FP8, quality expected ≈ W4A16) quantizes fine, but vLLM 0.23's loader doesn't yet accept its
quant_algo— aw4a8/tier will be added once the engine supports it. - Load commands / MTP settings / full comparison: see the NVFP4 repo card.
Eval protocol
thinking-on; temp 0.6 / top_p 0.95 (required for thinking models — greedy loops to death); max_tokens 32768; read-timeout ≥ 2400s. The same spec is applied to every compared model.
Limitations
- Distills thinking style, not capability: black-box SFT cannot raise the knowledge ceiling.
- Tool execution results are "simulated", not actually run:
- This version (compromise): each "execution result" line in the multi-turn tool calls is improvised by a small model (DeepSeek-V4-Flash) role-playing the "runtime", not obtained by actually running code in a sandbox. Chosen purely for cost and speed — real execution needs a full "generate → run in a real sandbox → feed results back to the teacher → continue" agentic harness, which is slow and heavy; one simulated pass is enough. This is an engineering trade-off, not because it is better.
- Cost (sim-to-real gap): a simulated result can be wrong (Flash may optimistically fabricate "tests passed" when that code would actually crash) → the model can learn from "fake-success" trajectories, and may even acquire the tendency to fabricate tool return values itself.
- Optimal approach (coming in the next version) = real-sandbox execution + rejection sampling: every tool call runs in a real environment to get a real result, then a judge keeps only the trajectories that genuinely solved the task and discards the failed ones — eliminating "fake success" at the root. We have already implemented this pipeline (real sandbox + DS judge), but this version's data did not use it; the next distillation will be redone with it.
- Note: simulation ≠ rejection sampling — simulation is about "how the observation is obtained" (fabricate vs. really run); rejection sampling is about "filtering out the wrong ones by real outcome". Because simulation never really runs, it leaves no ground on which rejection sampling could even operate.
Files
*.safetensors— BF16 merged weights (SGLang / vLLM / transformers)gguf/Qwen3.6-27B-DSV4Pro-Distill-MTP-Q4_K_M-imatrix.gguf— the only GGUF, native MTP version (Q4_K_M-imatrix). Add--spec-type draft-mtpfor the fastest single-stream; without that flag it is just a normal Q4_K_M model (MTP head inactive) — so no separate "non-MTP plain version" is provided, to keep anyone from downloading the wrong file and thinking it lacks MTP.- NVFP4 (W4A16 + W4A4, both +MTP) — quality-first ModelOpt NVFP4. Language MLP
gate/up/down_projcompressed to FP4; attention, Mamba, vision, embeddings,lm_head, norms kept high-precision. vLLM/SGLang. W4A16 (quality, GPQA 82.83 / MMLU 87.80, 29G) + W4A4 (speed/VRAM, 77.27 / 91.40, 20G), both ship the official MTP head — fast single-stream too (@R6000: W4A16 93.7 / W4A4 111 tok/s).
Inference
thinking-on, always use temp=0.6, top_p=0.95 (never greedy). llama.cpp: gguf + --jinja (MTP version add --spec-type draft-mtp --spec-draft-n-max 3); SGLang / vLLM: safetensors.
🇨🇳 中文版
在 Qwen3.6-27B(Dense,64 层,Gated DeltaNet 线性/全注意力混合)上,用 LoRA 蒸馏 DeepSeek-V4-Pro 在「思考开启(thinking-on)」时的思维方式 + agentic 行为。
这是 35B-A3B(MoE)姊妹版的 Dense 复现:同一台 R6000、同一 teacher、同一套配方,换到 Dense 架构——证明提升来自蒸进去的思维方式,不是 MoE 架构红利。并焊了原生 MTP 做单流加速。
⚠️ 蒸思维方式 ≠ 蒸知识/能力:目标是「学会怎么想、怎么收口」,不是蒸知识或扩能力上限。
训练配置(如实披露)
- 基座 Base:Qwen3.6-27B(Dense,BF16 基座)
- 方法 Method:LoRA,r = 64,α = 128,dropout = 0.05,target = 全部注意力 + MLP 投影
- 优化:paged_adamw_8bit,cosine LR,warmup 0.03,约 1 epoch
- Teacher:DeepSeek-V4-Pro(thinking-on + agentic)
- 数据 Data:~1842 条蒸馏样本(lynn_prod 口径)。轨迹 = DS-V4-Pro 在 thinking-on 下的多步推理(
<think>)+ ReAct 式工具调用(想一步 → 调一次工具 → 看结果,循环)。- 工具的「执行结果」是模拟的,不是真跑的:多轮工具调用里那一行行「执行结果」,是用一个又小又快的模型(DeepSeek-V4-Flash)扮演"运行环境"现编出来的,并不是真的在沙箱里跑代码得到的——所以和真实运行有差距。
- **训练时只学"怎么想、怎么调工具",不学那些编出来的"执行结果"**:因为执行结果是假的,如果让模型去学它,模型就会养成"自己瞎编工具返回值"的坏习惯;所以我们只优化模型自己产出的「思考 + 工具调用」部分。
- 产物:合并 → BF16 safetensors →
gguf/Q4_K_M-imatrix(含原生 MTP 版)
方法非自创,是公开技术的组合(如实归因)
- ReAct(推理+行动交替):Yao et al., 2022, arXiv:2210.03629(ICLR 2023)
- **STaR(reasoning trace 自举)**:Zelikman et al., 2022, arXiv:2203.14465
- Self-Instruct / Baize 自对话:Wang et al., 2022;Xu et al., 2023, arXiv:2304.01196
- AgentTuning:Zeng et al., 2023, arXiv:2310.12823
- **ToolBench / ToolLLM(工具调用)**:Qin et al., 2023, arXiv:2307.16789
- DeepSeek-R1 推理蒸馏:DeepSeek-AI, 2025, arXiv:2501.12948
评测(Q4_K_M,旧 harness):同一 harness,thinking-on,vs 原版 Qwen3.6-27B
量化口径(重要,防误读):本模型与原版 base 都是 Q4_K_M(imatrix 校正)GGUF + 原生 MTP,完全同口径 —— 同 Q4_K_M、同 imatrix、同 MTP,唯一变量是"是否蒸馏"。不是拿蒸馏-Q4_K_M 去比 base-BF16(那样不公平);下面的 Δ 干净地归因于蒸馏本身,不掺量化差异。
| 维度 | 本模型(蒸馏) | 原版 base | Δ |
|---|---|---|---|
| GPQA-Diamond-198 | 80.81%(160/198,32K) | 73.7%(146/198) | +7.1 |
| MMLU-500 (5-shot) | 91.8%(459/500) | 91.6% | +0.2 |
| GPQA 未收口空答 (parse_fail) | 0 | 14 | -14 |
| coding-100 (10 语言×10) | 86/100 | 83/100 | +3 |
| Agentic SOLO (20 复杂任务) | 16/20 | 13/20 | +3 |
解读:硬推理显著提升(GPQA +7.1pp,160/198=80.81%,0 error/0 parse_fail)、知识不降反微涨(MMLU +0.2pp),且「想完就收口」—— GPQA 未收口空答从 14 降到 0(base 那 14 个全是思考到 32K 上限还没给答案;蒸馏后彻底归零)。中位生成长度压到 ~3006 token。注:35B-A3B 蒸馏 MMLU 掉 1.6pp,27B Dense 容量更大、装得下蒸馏而不挤占知识 —— **GPQA 涨、MMLU 不降,更干净的"纯赚"**。
coding-100:同一 harness、真沙箱跑代码看测试是否真过(客观)。distill 86 ≥ base 83,coding 能力没掉、还略高。
Agentic SOLO:模型自己编排+自己执行 20 道复杂任务,判官 = 出题/harness 作者(对"做没做到"最清楚)。distill 16 > base 13。⚠️ 此项判官主观性强(换更严判官两者打平),作趋势参考,硬指标看 GPQA/coding。
Q5_K_M 评测 —— 蒸馏 vs 原版,流式 harness(重测)
口径(已标注 —— 与上方 Q4_K_M 段口径不同,禁止跨档比):本蒸馏与原版均为 Q5_K_M-imatrix GGUF + 原生 MTP、同规格、base 模式(MTP 关)跑并发评测。蒸馏 harness = SSE 流式(
stream=True—— 根除非流式客户端"干等满整段"造成的假超时,否则长思考会被误判超时),timeout 1800s · 并发 4 · max_tokens 32000,thinking-on,temp 0.6 / top_p 0.95,逐题记finish_reason→ stop(收口)/ length(撞 32K 墙、没答案)/ error。base GPQA 取自最初 conc=4(非流式)run,但finish_reason已确认 0 error(零假超时)—— 那 12 个length各completion_tokens=32768,是真没收敛、非 harness artifact;base MMLU 用与蒸馏同一套 SSE 流式 harness 重测。
| 维度 | 蒸馏 Q5_K_M | 原版 Q5_K_M | Δ |
|---|---|---|---|
| GPQA-Diamond-198 | 81.82%(162/198) | 68.69%(136/198) | +13.13pp |
| MMLU-500(5-shot) | 90.0%(450/500) | 89.6%(448/500) | +0.4pp |
GPQA finish=stop(收口) |
198 / 198 | 186 / 198 | |
GPQA finish=length(撞 32K 墙、始终没答) |
0 | 12 | |
| GPQA error(超时等) | 0 | 0 |
解读:流式 harness 下(errors=0 证明零假超时),蒸馏每题都收口(198/198 stop、0 length),而原版有 12 题撞 32K 墙(length、始终给不出答案)。这是蒸馏「学会收口」最硬、最干净的证据 —— 由 finish_reason 量化,不只看准确率。
⚠️ 禁止跨量化档比:Q4_K_M 表用旧(非流式)harness,本 Q5_K_M 表用流式 harness。跨档比(如 base-Q4 vs base-Q5)无意义(harness 不同)。只有同档内 蒸馏-vs-原版 的 Δ 有效。
MTP(多 token 预测)单流加速 — 实测最佳配置 + 无损
本模型的 gguf 含原生 MTP 头(mainline llama.cpp --spec-type draft-mtp,无需 -md / 外挂 draft 模型)。
两条 MTP 通路:GGUF 走 llama.cpp(
--spec-type draft-mtp,见下);BF16 / FP8 safetensors 走 vLLM / SGLang(--speculative-config '{"method":"mtp","num_speculative_tokens":3}')—— BF16 与 FP8 仓现均已焊原生 nextn 头(SGLang 实测 accept 0.76–0.88;BF16 在 Ampere 等无原生 FP8 的卡上更合适)。
**最佳配置实测(单流,Q4_K_M-imatrix;测于 DGX Spark GB10,统一内存带宽受限 —— Mac / Blackwell RTX-50(FP4) 可更快)**:
--spec-draft-n-max(p-min=0) |
单流 TPS | draft 接受率 | 平均接受长度 |
|---|---|---|---|
| 裸版 no-MTP | 10.4 | — | — |
| n-max=2 | 24.1 | 0.82 | 2.64 |
| n-max=3 ⭐(推荐) | 26.8 | 0.72 | 3.16 |
| n-max=4 | 27.4 | 0.65 | 3.62 |
- 2.3–2.6× 单流加速(vs 裸版 10.4 TPS);n-max=3 是吞吐/接受率平衡点。
- 贪心投机解码构造上无损:only accepts target-argmax token。批量 verify 的 GEMM 舍入会在 near-tie token 上产生字符级差异,这是 FP 非确定性、非质量损失(任意两次独立进程都会,哪怕都不开 MTP)。
- 投机=单流延迟工具,并发会退化(spec token 占 KV/batch 容量)—— 吞吐场景请用多并发裸版。
- 推荐启动:
llama-server -m *-MTP-Q4_K_M-imatrix.gguf --spec-type draft-mtp --spec-draft-n-max 3 --jinja
注:单流 TPS 因内容而异——编码类 prompt 接受率 ~0.72 → 26.8 t/s,推理类 ~0.58 → 24.6 t/s(n-max=3,均测于 DGX Spark GB10)。当前 MTP 为 base 原生 nextn 头嫁接(无损);base 头在蒸馏后偏移的推理分布上预测偏弱,故推理类接受率偏低。蒸馏专属重训 MTP 头(把接受率拉回 ~0.8)在路线图上。
各量化档 MTP 速度对比 / Per-quant MTP speed
所有 gguf 均焊原生 MTP。实测 DGX Spark GB10,单流,coding prompt,thinking-on,--spec-draft-n-max 3:
| 量化档 / Quant | 体积 / Size | 裸版 base TPS | MTP TPS | 加速 / Speedup | 接受率 / Accept |
|---|---|---|---|---|---|
Q4_K_M-imatrix |
16.8 GB | 10.4 | 26.8 | 2.6× | 0.72 |
Q5_K_M-imatrix |
19.5 GB | 10.37 | 24.65 | 2.38× | 0.72 |
Q6_K-imatrix |
22.4 GB | 9.18 | 22.07 | 2.40× | 0.73 |
Q8_0 |
29 GB | 7.82 | 17.12 | 2.19× | 0.67 |
越小越快(内存带宽受限);MTP 全档 2.2–2.6× 加速,各档输出实测均正确(回文 / fibonacci 等编码题)。**Q8_0 ≈ BF16 质量**(8-bit 近无损;不带 imatrix——均匀 8-bit,重要性加权对它无意义)。
Smaller = faster (memory-bandwidth-bound); MTP gives 2.2–2.6× across all tiers; Q8_0 ≈ BF16 quality (near-lossless 8-bit).
📏 GB vs GiB:表中体积为十进制 GB(10⁹ 字节);显卡显存按 GiB(2³⁰) 造却标 "GB",故 "24 GB" 卡实为 24 GiB ≈ 25.8 GB —— Q6_K(22.4 GB = 20.9 GiB)能进 24 GB 卡不用 offload。规则:文件的 GiB 数 < 卡标称 "GB" 数即可不 offload。
NVFP4 量化(vLLM / SGLang 高并发 + MTP)/ NVFP4 (vLLM/SGLang + MTP)
NVFP4 多档量化 + 官方 MTP 头在 独立 NVFP4 仓(ModelScope · HF nerkyor),面向 vLLM / SGLang 部署。
⚠️ 速度 @ RTX PRO 6000 Blackwell (sm120) + vLLM;上方 GGUF 速度 @ Spark GB10 + llama.cpp —— 不同硬件,勿混读。
| 档 / Tier | 目录 | GPQA-D 198 | MMLU-500 | 大小 | MTP 单流峰值 @R6000 | 推荐 N |
|---|---|---|---|---|---|---|
| W4A16 | 根 / root | 82.83% | 87.80% | 29G | 93.7 tok/s (2.08×) | 4 |
| W4A4 | w4a4/ |
77.27% | 91.40% | 20G | 111 tok/s (1.76×) | 2 |
| W4A8 | 待引擎支持 | ≈W4A16(预期) | — | ~19G | — | — |
- W4A16:质量优先(仅量 MLP 权重,attention/Mamba/视觉等保 BF16)。W4A4:速度/显存优先(量 MLP+attention+GDN,conv1d 保 BF16)。
- 两档均内置官方 nextn MTP 头;MTP 并发同样正收益(W4A4 c16 仍 +44%,临界并发 >16 不转负)。
- W4A8(weight FP4 + act FP8,质量预期 ≈ W4A16)已可量化但 vLLM 0.23 loader 暂不支持其 quant_algo,待引擎支持后补
w4a8/。 - 加载命令 / MTP 设置 / 完整对比详见 **NVFP4 仓卡片**。
评测口径 / Eval protocol
thinking-on;temp 0.6 / top_p 0.95(thinking 模型必需,greedy 会重复死循环);max_tokens 32768;read-timeout ≥2400s。同口径作用于所有对比模型。
局限 / Limitations
- 蒸思维方式,非蒸能力:黑盒 SFT 抬不高知识天花板。
- 工具执行结果是"模拟"的,不是真跑出来的:
- 本版(迁就方案):多轮工具调用里那一行行「执行结果」,是用一个小模型(DeepSeek-V4-Flash)扮演"运行环境"现编的,不是真在沙箱里跑代码得到的。选它纯粹是为了省成本、快——真实执行需要一整套"边生成边在真沙箱里跑、再把结果喂回 teacher 继续"的 agentic harness,慢且重;模拟一遍过就行。这是工程上的取舍,不是因为它更好。
- 代价(sim-to-real gap):模拟结果可能是错的(flash 会乐观地编一句"测试通过",但那段代码真跑其实会挂)→ 模型可能从"假成功"的轨迹里学到东西,甚至养成"自己瞎编工具返回值"的倾向。
- 最优方案(下一版补)= 真沙箱执行 + 拒绝采样:每一步工具调用都在真实环境里跑出真结果,再用判官只保留"真正把任务做对"的轨迹、扔掉失败的,从根上消除"假成功"。这条管线我们已经实现(真沙箱 + DS 判官),但本版数据未纳入,下一版蒸馏会用它重做。
- 注:模拟 ≠ 拒绝采样——模拟是"怎么拿到 observation"(编 vs 真跑),拒绝采样是"按真实结果筛掉做错的";模拟因为没真跑,反而让拒绝采样无从谈起。
文件 / Files
*.safetensors— BF16 合并权重(SGLang / vLLM / transformers)gguf/— 4 档原生 MTP GGUF(都焊 MTP;加--spec-type draft-mtp单流最快,不加即当普通 gguf 用,MTP 头不激活):…-MTP-Q4_K_M-imatrix.gguf(~16 GB,最快)…-MTP-Q5_K_M-imatrix.gguf(19.5 GB)…-MTP-Q6_K-imatrix.gguf(22.4 GB)…-MTP-Q8_0.gguf(29 GB,≈ BF16 质量,无 imatrix)
- FP8(block-128 e4m3 + 原生 MTP,SGLang serving)在独立仓
Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8 - NVFP4 多档(W4A16 + W4A4,均含 MTP)(质量优先,vLLM/SGLang 高并发)—— language MLP gate/up/down_proj NVFP4,其余模块保高精度。W4A16(质量优先,82.83/87.80,29G)+ W4A4(速度/显存,77.27/91.40,20G),两档均内置官方 MTP,单流也快(@R6000:W4A16 93.7 / W4A4 111 tok/s)。
推理 / Inference
thinking-on,务必 temp=0.6, top_p=0.95(切勿 greedy)。llama.cpp 用 gguf + --jinja(MTP 版加 --spec-type draft-mtp --spec-draft-n-max 3);SGLang/vLLM 用 safetensors。
Claude Code(实验性 / experimental)
Claude Code 支持目前属于实验性接入。Claude Code 需要兼容 Anthropic /v1/messages 的服务端和稳定的工具调用;直接使用 OpenAI-compatible chat endpoint 可能无法正常工作,需要桥接或兼容运行时。
本模型使用 Qwen3 XML 工具调用格式(<tool_call><function=name><parameter=...>)。vLLM 路线:
vllm serve /path/to/model \
--served-model-name qwen36-27b-distill \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml
Claude Code 指向 served-model-name(不要用带 / 的 HF repo id);或使用 LM Studio 0.4.1+(内置 Claude Code /v1/messages)。参考 vLLM Claude Code · LM Studio。
Claude Code (experimental)
Claude Code support is experimental. It needs an Anthropic-compatible /v1/messages endpoint and stable tool calling; a direct OpenAI-compatible chat endpoint may not work and requires a bridge or compatible runtime. The model uses the Qwen3 XML tool format — on vLLM use --tool-call-parser qwen3_xml (command above). Point Claude Code at the served-model-name (no slashes), or use LM Studio 0.4.1+ (built-in Claude Code /v1/messages).
- Downloads last month
- 20,250