GLM-5.2-JANGTQ_K

JANGTQ (JANG TurboQuant) quantization of zai-org/GLM-5.2 (744B-parameter glm_moe_dsa MoE) for MLX on Apple silicon. TurboQuant applies a random-sign Hadamard rotation, a per-row FP16 norm, and a per-(dim,bits) Lloyd-Max codebook to the routed experts, keeping the backbone at higher precision.

Profile: JANGTQ_K (max-quality mixed precision) — ~260 GB on disk.

Component	Precision
Routed experts — gate_proj / up_proj	2-bit MXTQ (codebook + Hadamard)
Routed experts — down_proj	4-bit MXTQ (codebook + Hadamard)
Attention (MLA + DSA indexer)	FP16
Shared experts	FP16
Router / norms	FP16
Embeddings / LM head	FP16

MTP (multi-token-prediction) head is dropped — it serves speculative decoding only and is unused by the MLX single-token decode path.

Requirements

~260 GB of unified memory. Whole-machine model; will not load alongside other large jobs. A 512 GB Mac (e.g. M3 Ultra) loads it comfortably.
Load with the jang-tools package. Not supported by stock MLX, LM Studio, or Ollama.
Requires an IndexShare-patched runtime (mandatory). GLM-5.2 introduces IndexShare: most sparse-attention layers are shared and carry no DSA indexer weights — they reuse the top-k token selections computed by the periodic full layers. Stock mlx_lm's glm_moe_dsa model runs an indexer on every layer, so it cannot load this bundle as-is.

IndexShare patch

Stock mlx_lm (through 0.31.3) does not implement IndexShare. The patched module ships in the jang-tools runtime: it builds a DSA indexer only on full layers and reuses the most-recent full layer's indices on shared layers (with a matching make_cache). On stock mlx_lm you must apply an equivalent override to mlx_lm/models/glm_moe_dsa.py before loading. Note: pip install -U mlx-lm overwrites the patch — re-apply after any upgrade.

Usage

from jang_tools.load_jangtq import load_jangtq_model as load
from mlx_lm import generate

model, tokenizer = load("bearzi/GLM-5.2-JANGTQ_K")
msgs = [{"role": "user", "content": "Write a Python function that reverses a string."}]
prompt = tokenizer.apply_chat_template(msgs, add_generation_prompt=True, tokenize=False)
print(generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True))

License

MIT, inherited from zai-org/GLM-5.2; quantization does not change the upstream terms. MIT requires retaining the copyright and license notice in redistributions.

Downloads last month: 5,383

Safetensors

Model size

79B params

Tensor type

F16

U32

MLX

Hardware compatibility

Quantized

Model tree for bearzi/GLM-5.2-JANGTQ_K

Base model

zai-org/GLM-5.2

Finetuned

(9)

this model

Collection including bearzi/GLM-5.2-JANGTQ_K

GLM-5.2-JANGTQ

Collection

3 items • Updated 14 days ago