Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

384 points • by theanonymousone • yesterday at 4:18 PM • 120 comments • view on HN

Comments

I just ran one of these locally on a Mac like this:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu \
    --prompt="Generate an SVG of a pelican riding a bicycle"

The first time you run that it downloads 3.2GB to ~/.cache/huggingface/hub/models--litert-community--gemma-4-E2B-it-litert-lm

It can handle audio and image input too, which is pretty cool for a 3.2GB model. For images:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu --vision-backend gpu \
    --attachment image.jpg --prompt describe

And for audio:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu --audio-backend cpu \
    --attachment audio.wav --prompt transcribe

(The pelican is rubbish, but it's only a 3.2GB file so the fact it even outputs valid SVG is impressive to me: https://gist.github.com/simonw/94b318afde4b1ce5ff67d4b5d0362... )

➕ show 3 replies

satvikpendem • yesterday at 5:32 PM

Unsloth's collection as well [0], with their results [1]. Looks like they can get very close to 100% accuracy compared to the BF16 model that is unquantized, and Unsloth's quants are better than the original Google's QAT as posted in the article.

Personal I'm using the 2B model for web search and structured JSON output back via Unsloth Studio and its API, works very well for that even with the model embedded on phones.

[0] https://huggingface.co/collections/unsloth/gemma-4-qat

[1] https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis

➕ show 4 replies

jhatax • yesterday at 9:50 PM

It’s the Friday before WWDC during which Apple is going to announce an “improved” Siri based on Google models (a locked partnership, for now). Maybe it’s a coincidence, but this might be Google releasing models that will be showcased next week by Apple?

No knowledge, just speculation.

➕ show 2 replies

jbarrow • yesterday at 8:42 PM

Very impressed with how much the Gemma ecosystem has advanced just this week.

Gemma 12B, multitoken prediction, and official quants released. Feels like Google is putting real effort into this string of releases, and I'm very excited to see that!

minimaxir • yesterday at 5:14 PM

It's a bit awkward to release Gemma 4 12B (https://news.ycombinator.com/item?id=48385906), and then a canonical Q4_0 Gemma 4 12B a couple days later.

It's good that this post lists the expected VRAM usage for the models with Q4_0 Gemma 4 12B being 6.7GB, which will indeed fit Google's claims of fitting within 16GB comfortably, altough it confirms that only the quantized version will do so.

Relatedly, in Google's newly released Edge Gallery for macOS, Gemma 4 12B is explicitly listed as unsupported due to not enough RAM even on a 16GB machine, but given the expected VRAM usage here the Q4_0 variant definitely should fit and Google should fix that.

➕ show 2 replies

taffydavid • today at 7:32 AM

Noob q: can advancements like this targeted at local inference have bonus effects for cloud inference? Presumably if you can get great results on cheaper hardware that also equates to less resource usage on cutting edge hardware, and less power draw?

Will advancements like this ultimately reduce the carbon footprint of AI?

➕ show 1 reply

arjun-mavonic • today at 4:17 PM

Yet to try this. But from what I heard from a friend is that Gemma 4 12b calls same tool’s repeatedly. Maybe harness can be made to handle it.

RandyOrion • today at 5:41 AM

From the perspective of a local llm user, I think the qat doesn't solve the major problem of the gemma models.

Gemma family (gen 1 to gen 4) is consistent with extreme range of activations, i.e., 600000, essentially forcing people to use bf16 kv cache and accept a short context window, e.g., 31b, iq4_xs quantization, 100k context window on 32gb memory. Or, people use q8 kv cache, 200k context window, and accept a large performance penalty.

In contrast, for qwen 3.5 family, the largest activation is below 2000, making q8 or even lower-precision kv cache essentially free estates. Together with linear attention, which doesn't require kv cache, full 262k context window can be easily reached.

Qat training with w4a16 target, while improving performance on inference with low-precision weighs, doesn't solve kv cache problem at all.

In the end, a qat is a qat, and there are unseen efforts behind qat checkpoints. Thank you gemma team for releasing qat checkpoints.

➕ show 1 reply

Catloafdev • yesterday at 8:28 PM

Being able to run the 12B on 8gb VRAM is huge. It's crazy to see how fast these small local models have evolved.

netdur • yesterday at 5:16 PM

had a good run with Gemma 4 E2B Unsloth 4Q: https://youtube.com/shorts/XLsAnz5aAAI

The E4B model doesn’t fit on my phone TPU, so it swaps to RAM, the QAT version means more accuracy, good!

➕ show 2 replies

jack_pp • today at 12:09 AM

Ran hf.co/google/gemma-4-12B-it-qat-q4_0-gguf:Q4_0 with ollama on a AMD Ryzen 9 8940HX, NVIDIA GeForce RTX 5060 (8 GB), 14 GB RAM laptop and it is suprisingly fast

WhiteDawn • yesterday at 6:50 PM

Once someone generates a MTP layer for 26B A4B 4 QAT I'll be singing from the hills with my 5 year old GPU.

➕ show 2 replies

somewhatrandom9 • yesterday at 6:09 PM

Could these quantized models make MTP (Multi-Token Prediction) significantly faster when used as drafters for larger regular Gemma 4 models?

➕ show 1 reply

nicman23 • today at 7:24 AM

the new 4 12b model replaced qwen3.6 27b for me. the task i am doing is a bit specific, validating if a stamp has the correct name but the ones that it could not see maybe a 30 percent were easily discerned.

superkuh • today at 2:11 AM

I wish they would release the base (non instruction tuned) models for use with pattern completion.

➕ show 1 reply

cr3cr3 • yesterday at 6:05 PM

For a moment I got excited thinking QAT is Intel Quick Assist Technology...

➕ show 1 reply

nazgul17 • today at 12:48 AM

I don't see these QAT models on Edge Gallery; just the BF16 models are there. Is there anything I am missing?

zkmon • yesterday at 7:24 PM

How can the smaller Unsloth GGUF quant can beat the original google quant? (ref: unsloth/gemma-4-31B-it-qat-GGUF)

➕ show 1 reply

Kylejeong21 • today at 12:19 AM

google pixel intelligence may beat apple intelligence

redox99 • yesterday at 6:53 PM

I was just testing Gemma E2B and E4B yesterday, and they are just too dumb to be useful outside of niche use cases.

Besides, there's no good agent on Android. Having a model that can't run web searches and browse websites is limited in use, particularly small models that really need to be grounded on search results to be factual, because they can't memorize enough.

Edit: I'd like to know what kind of usage the people that seem to disagree and downvoted this are having.

➕ show 1 reply

refulgentis • yesterday at 5:22 PM

@google.com'ers, there are no GGUFs (blog says there is)

➕ show 1 reply

comparedge • yesterday at 6:21 PM

[flagged]

Pixel-Labs • yesterday at 6:29 PM

[flagged]

spacebacon • yesterday at 6:49 PM

[flagged]

steno132 • yesterday at 8:35 PM

I don't get this obsession with smaller models. I've been using Claude and GPT models for years and have had zero issues with them.

I see absolutely no benefit to me as a end user for a local model which is going to take up more of my CPU and memory and slow down my machine. I almost always have Internet and if I don't then not having access to a AI model is the least of my concerns.

➕ show 6 replies

alt Hacker News

Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

Comments