Multimodal Support

#16
by WhiteSinner0 - opened

Hello!
Thank you for your hard work. I have really enjoyed using this model in my workflow.

I would like to ask whether there are any plans to add support for multimodal content, or at least image input capabilities, in the future.

Hi @WhiteSinner0 , thanks β€” glad it's been useful in your workflow.

Good news: you don't have to wait for this. The model is built on Gemma 4 12B, which is natively multimodal (image
input). My fine-tune only touched the text/coding side and left the vision tower untouched, so the image-understanding
capability is inherited straight from the base model β€” it's already in the weights.

The only reason image input doesn't work out of the box is that I didn't bundle the multimodal projector (mmproj) in
this repo. You can grab the official one from ggml-org/gemma-4-12B-it-GGUF:

  • mmproj-gemma-4-12B-it-bf16.gguf (175 MB), or
  • mmproj-gemma-4-12B-it-Q8_0.gguf (159 MB)

Then load it alongside whichever v2 quant you're running, via --mmproj:

llama-server -m .gguf --mmproj mmproj-gemma-4-12B-it-bf16.gguf -ngl 99

One honest caveat: I trained on text only, so I didn't specifically tune or benchmark vision β€” what you get is the
base Gemma 4 vision passing through untouched, not something I improved. I haven't run image input against the
fine-tuned weights myself, so if you try it I'd be glad to hear how it holds up.

Multimodal

https://huggingface.co/tepirale/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-assistant-safetensors-yuxinlu1
https://huggingface.co/tepirale/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-safetensors-yuxinlu1

Sign up or log in to comment