GLM-5.2 โ€” Mixed-Precision GGUF (IQ2_S experts ยท IQ4_NL rest)

Mixed-bit GGUF re-quantization of zai-org/GLM-5.2 (256ร—22B Mixture-of-Experts, architecture glm-dsa), produced with llama-quantize's per-tensor mixed-precision workflow and an importance matrix.

The target of this quant is memory footprint: MoE expert fragments are dropped to IQ2_S (โ‰ˆ2.64 BPW overall) while the dense/sensitive tensors (attention, norms, shared components, embeddings) stay at IQ4_NL. The last MoE block (blk.78.*) is kept at IQ4_NL because llama-quantize does not find separate weights for it (it is covered by the shared/head tensors) โ€” see the mapping table below.

Model particulars (from GGUF KV metadata)

Key Value
Architecture glm-dsa
Name / version GLM-5.2 / 5.2
Size label 256x22B (256 experts, 8 active, 1 shared)
Block count 79 (3 leading dense + 76 MoE)
Context length 1,048,576 (1M tokens)
Embedding length 6144
Feed-forward length (dense) 12288
Expert FF length 2048
Attention heads / KV heads 64 / 1 (MLA, q_lora_rank=2048, kv_lora_rank=512, key_length_mla=256, value_length_mla=256)
RoPE base / dim 8,000,000 / 64
Vocabulary 154,880 (tokenizer glm4 / gpt2)
Expert gating func=2, weights_scale=2.5, weights_norm=true
NextN predict layers 1
License MIT

Quantization mapping

Per-tensor type assignment passed to llama_quantize (--tensor-type overrides):

Tensor pattern Quant
blk.78.ffn_down_exps IQ4_NL
blk.78.ffn_gate_exps IQ4_NL
blk.78.ffn_up_exps IQ4_NL
ffn_gate_exps (all other blocks) IQ2_S
ffn_up_exps (all other blocks) IQ2_S
ffn_down_exps (all other blocks) IQ2_S
everything else (attention, norms, embeddings, shared head, indexer) IQ4_NL
  • Source GGUF: unsloth/GLM-5.2-GGUF โ†’ this repo's UD-IQ4_NL variant (9 shards, IQ4_NL), re-quantized here with allow-requantize + keep-split.
  • Importance matrix: imatrix_unsloth.gguf (sourced from Unsloth).
  • Final size: โ‰ˆ232 GiB across 9 shards, โ‰ˆ2.64 BPW (input โ‰ˆ3.95 BPW at q8_0).

Files

Filenames include the IQ2_S/IQ4_NL quant tokens so Hugging Face's quantization-variant scanner recognizes the shards (single-quant names are not possible for a mixed-precision quant; both constituent quants are listed).

File Size
GLM-5.2-mixed-IQ2_S-IQ4_NL-00001-of-00009.gguf 9.0 MiB (headers/tokenizer)
GLM-5.2-mixed-IQ2_S-IQ4_NL-00002-of-00009.gguf 29.9 GiB
GLM-5.2-mixed-IQ2_S-IQ4_NL-00003-of-00009.gguf 31.0 GiB
GLM-5.2-mixed-IQ2_S-IQ4_NL-00004-of-00009.gguf 31.1 GiB
GLM-5.2-mixed-IQ2_S-IQ4_NL-00005-of-00009.gguf 31.0 GiB
GLM-5.2-mixed-IQ2_S-IQ4_NL-00006-of-00009.gguf 31.0 GiB
GLM-5.2-mixed-IQ2_S-IQ4_NL-00007-of-00009.gguf 31.1 GiB
GLM-5.2-mixed-IQ2_S-IQ4_NL-00008-of-00009.gguf 31.0 GiB
GLM-5.2-mixed-IQ2_S-IQ4_NL-00009-of-00009.gguf 16.2 GiB

Usage

Load with any recent llama.cpp build (and compatible runners โ€” LM Studio, Ollama, koboldcpp, etc.) that supports the glm-dsa architecture, MLA attention and IQ2_S / IQ4_NL dequantization (GPU offload strongly recommended).

llama-server \
  -m GLM-5.2-mixed-IQ2_S-IQ4_NL-00001-of-00009.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 999 -c 8192

The first shard is treated as the entry point; llama.cpp follows the split-file links to load all 9 shards automatically. Point -m at 00001-of-00009.

Provenance

  • Base model: zai-org/GLM-5.2 โ€” MIT.
  • Source GGUF quantization: Unsloth (general.quantized_by = Unsloth, general.repo_url = https://huggingface.co/unsloth).
  • This mixed-precision re-quant + imatrix: Deviad (2026-06-20), on Apple M3 Ultra (Metal 4 build of llama.cpp, build 1 / 4b48a53).

Disclaimer

This is an aggressive low-bit quantization intended to fit a very large MoE into constrained memory. Expect measurable quality degradation versus the source, especially because the expert tensors are at IQ2_S. Validate on your own tasks.

Downloads last month
289
GGUF
Model size
754B params
Architecture
glm-dsa
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest

Base model

zai-org/GLM-5.2
Quantized
(73)
this model