GLM-5.2-JANGTQ_K

JANGTQ (JANG TurboQuant) quantization of zai-org/GLM-5.2 (744B-parameter glm_moe_dsa MoE) for MLX on Apple silicon. TurboQuant applies a random-sign Hadamard rotation, a per-row FP16 norm, and a per-(dim,bits) Lloyd-Max codebook to the routed experts, keeping the backbone at higher precision.

Profile: JANGTQ_K (max-quality mixed precision) — ~260 GB on disk.

Component Precision
Routed experts — gate_proj / up_proj 2-bit MXTQ (codebook + Hadamard)
Routed experts — down_proj 4-bit MXTQ (codebook + Hadamard)
Attention (MLA + DSA indexer) FP16
Shared experts FP16
Router / norms FP16
Embeddings / LM head FP16

MTP (multi-token-prediction) head is dropped — it serves speculative decoding only and is unused by the MLX single-token decode path.

Requirements

  • ~260 GB of unified memory. Whole-machine model; will not load alongside other large jobs. A 512 GB Mac (e.g. M3 Ultra) loads it comfortably.
  • Load with the jang-tools package. Not supported by stock MLX, LM Studio, or Ollama.
  • Requires an IndexShare-patched runtime (mandatory). GLM-5.2 introduces IndexShare: most sparse-attention layers are shared and carry no DSA indexer weights — they reuse the top-k token selections computed by the periodic full layers. Stock mlx_lm's glm_moe_dsa model runs an indexer on every layer, so it cannot load this bundle as-is.

IndexShare patch

Stock mlx_lm (through 0.31.3) does not implement IndexShare. The patched module ships in the jang-tools runtime: it builds a DSA indexer only on full layers and reuses the most-recent full layer's indices on shared layers (with a matching make_cache). On stock mlx_lm you must apply an equivalent override to mlx_lm/models/glm_moe_dsa.py before loading. Note: pip install -U mlx-lm overwrites the patch — re-apply after any upgrade.

Usage

from jang_tools.load_jangtq import load_jangtq_model as load
from mlx_lm import generate

model, tokenizer = load("bearzi/GLM-5.2-JANGTQ_K")
msgs = [{"role": "user", "content": "Write a Python function that reverses a string."}]
prompt = tokenizer.apply_chat_template(msgs, add_generation_prompt=True, tokenize=False)
print(generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True))

License

MIT, inherited from zai-org/GLM-5.2; quantization does not change the upstream terms. MIT requires retaining the copyright and license notice in redistributions.

Downloads last month
5,383
Safetensors
Model size
79B params
Tensor type
F16
·
U32
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bearzi/GLM-5.2-JANGTQ_K

Base model

zai-org/GLM-5.2
Finetuned
(9)
this model

Collection including bearzi/GLM-5.2-JANGTQ_K