Instructions to use Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest", filename="GLM-5.2-mixed-IQ2_S-IQ4_NL-00001-of-00009.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL # Run inference directly in the terminal: llama cli -hf Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL # Run inference directly in the terminal: llama cli -hf Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL # Run inference directly in the terminal: ./llama-cli -hf Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL # Run inference directly in the terminal: ./build/bin/llama-cli -hf Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL
Use Docker
docker model run hf.co/Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL
- LM Studio
- Jan
- Ollama
How to use Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest with Ollama:
ollama run hf.co/Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL
- Unsloth Studio
How to use Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest to start chatting
- Pi
How to use Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest with Docker Model Runner:
docker model run hf.co/Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL
- Lemonade
How to use Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest:IQ4_NL
Run and chat with the model
lemonade run user.GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest-IQ4_NL
List all available models
lemonade list
GLM-5.2 โ Mixed-Precision GGUF (IQ2_S experts ยท IQ4_NL rest)
Mixed-bit GGUF re-quantization of zai-org/GLM-5.2 (256ร22B Mixture-of-Experts,
architecture glm-dsa), produced with llama-quantize's per-tensor mixed-precision
workflow and an importance matrix.
The target of this quant is memory footprint: MoE expert fragments are dropped
to IQ2_S (โ2.64 BPW overall) while the dense/sensitive tensors (attention, norms,
shared components, embeddings) stay at IQ4_NL. The last MoE block (blk.78.*) is
kept at IQ4_NL because llama-quantize does not find separate weights for it (it is
covered by the shared/head tensors) โ see the mapping table below.
Model particulars (from GGUF KV metadata)
| Key | Value |
|---|---|
| Architecture | glm-dsa |
| Name / version | GLM-5.2 / 5.2 |
| Size label | 256x22B (256 experts, 8 active, 1 shared) |
| Block count | 79 (3 leading dense + 76 MoE) |
| Context length | 1,048,576 (1M tokens) |
| Embedding length | 6144 |
| Feed-forward length (dense) | 12288 |
| Expert FF length | 2048 |
| Attention heads / KV heads | 64 / 1 (MLA, q_lora_rank=2048, kv_lora_rank=512, key_length_mla=256, value_length_mla=256) |
| RoPE base / dim | 8,000,000 / 64 |
| Vocabulary | 154,880 (tokenizer glm4 / gpt2) |
| Expert gating | func=2, weights_scale=2.5, weights_norm=true |
| NextN predict layers | 1 |
| License | MIT |
Quantization mapping
Per-tensor type assignment passed to llama_quantize (--tensor-type overrides):
| Tensor pattern | Quant |
|---|---|
blk.78.ffn_down_exps |
IQ4_NL |
blk.78.ffn_gate_exps |
IQ4_NL |
blk.78.ffn_up_exps |
IQ4_NL |
ffn_gate_exps (all other blocks) |
IQ2_S |
ffn_up_exps (all other blocks) |
IQ2_S |
ffn_down_exps (all other blocks) |
IQ2_S |
| everything else (attention, norms, embeddings, shared head, indexer) | IQ4_NL |
- Source GGUF:
unsloth/GLM-5.2-GGUFโ this repo'sUD-IQ4_NLvariant (9 shards, IQ4_NL), re-quantized here withallow-requantize+keep-split. - Importance matrix:
imatrix_unsloth.gguf(sourced from Unsloth). - Final size: โ232 GiB across 9 shards, โ2.64 BPW (input โ3.95 BPW at q8_0).
Files
Filenames include the IQ2_S/IQ4_NL quant tokens so Hugging Face's
quantization-variant scanner recognizes the shards (single-quant names are not
possible for a mixed-precision quant; both constituent quants are listed).
| File | Size |
|---|---|
GLM-5.2-mixed-IQ2_S-IQ4_NL-00001-of-00009.gguf |
9.0 MiB (headers/tokenizer) |
GLM-5.2-mixed-IQ2_S-IQ4_NL-00002-of-00009.gguf |
29.9 GiB |
GLM-5.2-mixed-IQ2_S-IQ4_NL-00003-of-00009.gguf |
31.0 GiB |
GLM-5.2-mixed-IQ2_S-IQ4_NL-00004-of-00009.gguf |
31.1 GiB |
GLM-5.2-mixed-IQ2_S-IQ4_NL-00005-of-00009.gguf |
31.0 GiB |
GLM-5.2-mixed-IQ2_S-IQ4_NL-00006-of-00009.gguf |
31.0 GiB |
GLM-5.2-mixed-IQ2_S-IQ4_NL-00007-of-00009.gguf |
31.1 GiB |
GLM-5.2-mixed-IQ2_S-IQ4_NL-00008-of-00009.gguf |
31.0 GiB |
GLM-5.2-mixed-IQ2_S-IQ4_NL-00009-of-00009.gguf |
16.2 GiB |
Usage
Load with any recent llama.cpp build (and compatible runners โ LM Studio, Ollama,
koboldcpp, etc.) that supports the glm-dsa architecture, MLA attention and IQ2_S /
IQ4_NL dequantization (GPU offload strongly recommended).
llama-server \
-m GLM-5.2-mixed-IQ2_S-IQ4_NL-00001-of-00009.gguf \
--host 0.0.0.0 --port 8080 \
-ngl 999 -c 8192
The first shard is treated as the entry point;
llama.cppfollows the split-file links to load all 9 shards automatically. Point-mat00001-of-00009.
Provenance
- Base model: zai-org/GLM-5.2 โ MIT.
- Source GGUF quantization: Unsloth (
general.quantized_by = Unsloth,general.repo_url = https://huggingface.co/unsloth). - This mixed-precision re-quant + imatrix: Deviad (2026-06-20),
on Apple M3 Ultra (Metal 4 build of
llama.cpp, build 1 /4b48a53).
Disclaimer
This is an aggressive low-bit quantization intended to fit a very large MoE into constrained memory. Expect measurable quality degradation versus the source, especially because the expert tensors are at IQ2_S. Validate on your own tasks.
- Downloads last month
- 289
4-bit
Model tree for Deviad/GLM-5.2-mixed-IQ2S-experts-IQ4NL-rest
Base model
zai-org/GLM-5.2