Instructions to use HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced", filename="Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M # Run inference directly in the terminal: llama cli -hf HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M # Run inference directly in the terminal: llama cli -hf HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M
Use Docker
docker model run hf.co/HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M
- Ollama
How to use HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced with Ollama:
ollama run hf.co/HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M
- Unsloth Studio
How to use HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced to start chatting
- Pi
How to use HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced with Docker Model Runner:
docker model run hf.co/HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M
- Lemonade
How to use HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M
Run and chat with the model
lemonade run user.Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced-Q4_K_M
List all available models
lemonade list
Configure the model in Pi
# Install Pi:
npm install -g @mariozechner/pi-coding-agent# Add to ~/.pi/agent/models.json:
{
"providers": {
"llama-cpp": {
"baseUrl": "http://localhost:8080/v1",
"api": "openai-completions",
"apiKey": "none",
"models": [
{
"id": "HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M"
}
]
}
}
}Run Pi
# Start Pi in your project directory:
piGemma4-12B-QAT-Uncensored-HauhauCS-Balanced
Join the Discord for updates, roadmaps, projects, or just to chat.
Gemma4-12B (QAT) uncensored by HauhauCS. 0/465 Refusals*
About
No changes to datasets or capabilities — fully functional, 100% of what the original authors intended, just without the refusals. Built from the official QAT weights, so the 4-bit quant stays close to full-precision quality.
Balanced
The Balanced variant (recommended — 99%+ of users will be happy here) uses optimized full uncensoring tuned especially for agentic coding, reasoning, creative writing and reliability-critical tasks. It reasons before answering and stays dependable and on-instruction. An Aggressive variant, for cases where Balanced still deflects too much, after current testing is not required.
~60% faster with MTP
Ships with an MTP (multi-token-prediction) draft head for speculative decoding — roughly 60% faster generation with identical output (the model verifies every drafted token, so quality is unchanged — pure speed). This release is tuned to pair well with the included MTP head.
llama.cpp:
llama-server \
-m Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced-Q4_K_M.gguf \
-md mtp-gemma-4-12B-it.gguf --spec-type draft-mtp \
-ngl 99 -fa on
Note: the MTP speedup was currently tested by me through llama.cpp (llama-server / llama-cli).
Downloads
| File | Type | Size |
|---|---|---|
Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced-Q4_K_M.gguf |
Q4_K_M (text) | 6.9 GB |
mmproj-Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced-BF16.gguf |
mmproj (vision) | 168 MB |
mtp-gemma-4-12B-it.gguf |
MTP speculative drafter | 242 MB |
Why only Q4_K_M? Gemma 4 is quantization-aware-trained for ~4-bit, so Q4_K_M is the sweet spot — higher-precision quants add size with no real quality gain. Carefully quantized for best quality at 4-bit.
Vision
Load the mmproj alongside the model for image input:
llama-server -m Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced-Q4_K_M.gguf \
--mmproj mmproj-Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced-BF16.gguf -ngl 99 -fa on
Recommended sampling
These are dialed in specifically for this HauhauCS build — use them for the intended behaviour and quality:
temperature 0.6top_k 64top_p 0.9min_p 0.05repeat_penalty 1.1
This release is tuned end-to-end as its own thing; the settings above are part of that and aren't the stock Gemma defaults.
Specs
- 12B dense · 256K (262144) context
- Vision (image input) via mmproj
- Based on Gemma 4 12B by Google DeepMind
Compatibility
- Works with llama.cpp, LM Studio, Jan, koboldcpp, and other GGUF runtimes.
- Multi-GPU + LM Studio: I've personally noticed Gemma 4 can crash under LM Studio's tensor-split mode — use a single GPU (layer-split or priority order) for this model.
Acknowledgements
- Google DeepMind — Gemma 4.
- The included
mtp-gemma-4-12B-it.ggufspeculative draft head comes from Unsloth's Gemma 4 release — many thanks to the Unsloth team for it.
* Tested with both automated and manual refusal benchmarks — none have been found in standard use. A small number of edge-case prompts deflect on the first ask but comply on a re-ask or strategic framing. If you hit one that's actually obstructive to your use case, join the Discord and flag it so I can work on it in a future revision.
- Downloads last month
- 59,538
4-bit
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp# Start a local OpenAI-compatible server: llama serve -hf HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced:Q4_K_M