What about oMLX?

#10
by OgulcanGungor - opened

Hey! Thanks for your great work. It is pleasure to use but what about omlx quants? Do you have any plan for it?

Thank you for your excellent work, having the same needs that。

Hey, thanks so much, glad you're enjoying it!

Quick update on MLX: I do have a Mac, but it's 32GB, so producing the MLX quant myself will be a little tight. Still
doable though, so no worries there.

First step, I'll push the full-precision safetensors master today or tomorrow. That's what MLX converts from, so once
it's up anyone on a Mac can make an MLX build, and honestly if someone in the community gets to it that'll be the
fastest route. If not, I'll get an MLX quant out myself by Monday.

Either way you'll have one soon. Thanks for the patience!

Any Update about that ?, And also is there is a way to set the jinja chat template so we avoid the native tool calling issue you typed about?.

@Ahmed-mohamed067 You're right to chase me on it, and sorry for the slip β€” I said that week and it ran over. Honest reason: I've been fully heads-down on v3 plus a new open-source collaboration with an AI lab. Here's the concrete plan: I'll push the fp16 safetensors master this weekend (that's what MLX converts from), and I'll reach out to Huihui to help fast-track an MLX build off the back of it. So you should have one soon β€” I'll link it here the moment it's up.

On the jinja template question β€” one important distinction:

  • The native tool-calling "issue" isn't a template bug; it's that some clients (LM Studio's minja, koboldcpp without --jinjatools) don't parse Gemma 4's native tool format, so the control tokens leak as raw text. A different template won't make those clients parse tools β€” for working tool-calls you need a backend that does (llama.cpp llama-server --jinja).
  • But if you just want plain chat without the tool noise, then yes β€” override the prompt template with a minja-safe, chat-only version (no tool scaffolding). It renders cleanly in LM Studio and avoids the leaked tokens; the tradeoff is you give up tool-calling. Happy to drop that chat-only template here β€” want it?

So: chat-only template = clean chat, no tools; llama-server --jinja = full tools. Pick by what you need.

Thank you so much I really appreciate your effort, I'm currently using oMLX and it's the fastest I could get from my device, GGUF work fine with llama.cpp but server is always restarting mid generation forcing the full page to regenerate so it always lose context ( I'm done searching for fix ) so I moved to oMLX , it's fast it's great and it does continue the session till the end , it even gave me a great memory room so I can fit bigger quants, but for all the gemma 4 I sometimes get tool call errors.

Sign up or log in to comment