logoalt Hacker News

Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

384 pointsby theanonymousoneyesterday at 4:18 PM120 commentsview on HN

Comments

simonwyesterday at 6:38 PM

I just ran one of these locally on a Mac like this:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu \
    --prompt="Generate an SVG of a pelican riding a bicycle"
The first time you run that it downloads 3.2GB to ~/.cache/huggingface/hub/models--litert-community--gemma-4-E2B-it-litert-lm

It can handle audio and image input too, which is pretty cool for a 3.2GB model. For images:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu --vision-backend gpu \
    --attachment image.jpg --prompt describe
And for audio:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu --audio-backend cpu \
    --attachment audio.wav --prompt transcribe
(The pelican is rubbish, but it's only a 3.2GB file so the fact it even outputs valid SVG is impressive to me: https://gist.github.com/simonw/94b318afde4b1ce5ff67d4b5d0362... )
show 3 replies
satvikpendemyesterday at 5:32 PM

Unsloth's collection as well [0], with their results [1]. Looks like they can get very close to 100% accuracy compared to the BF16 model that is unquantized, and Unsloth's quants are better than the original Google's QAT as posted in the article.

Personal I'm using the 2B model for web search and structured JSON output back via Unsloth Studio and its API, works very well for that even with the model embedded on phones.

[0] https://huggingface.co/collections/unsloth/gemma-4-qat

[1] https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis

show 4 replies
jhataxyesterday at 9:50 PM

It’s the Friday before WWDC during which Apple is going to announce an “improved” Siri based on Google models (a locked partnership, for now). Maybe it’s a coincidence, but this might be Google releasing models that will be showcased next week by Apple?

No knowledge, just speculation.

show 2 replies
jbarrowyesterday at 8:42 PM

Very impressed with how much the Gemma ecosystem has advanced just this week.

Gemma 12B, multitoken prediction, and official quants released. Feels like Google is putting real effort into this string of releases, and I'm very excited to see that!

minimaxiryesterday at 5:14 PM

It's a bit awkward to release Gemma 4 12B (https://news.ycombinator.com/item?id=48385906), and then a canonical Q4_0 Gemma 4 12B a couple days later.

It's good that this post lists the expected VRAM usage for the models with Q4_0 Gemma 4 12B being 6.7GB, which will indeed fit Google's claims of fitting within 16GB comfortably, altough it confirms that only the quantized version will do so.

Relatedly, in Google's newly released Edge Gallery for macOS, Gemma 4 12B is explicitly listed as unsupported due to not enough RAM even on a 16GB machine, but given the expected VRAM usage here the Q4_0 variant definitely should fit and Google should fix that.

show 2 replies
taffydavidtoday at 7:32 AM

Noob q: can advancements like this targeted at local inference have bonus effects for cloud inference? Presumably if you can get great results on cheaper hardware that also equates to less resource usage on cutting edge hardware, and less power draw?

Will advancements like this ultimately reduce the carbon footprint of AI?

show 1 reply
arjun-mavonictoday at 4:17 PM

Yet to try this. But from what I heard from a friend is that Gemma 4 12b calls same tool’s repeatedly. Maybe harness can be made to handle it.

RandyOriontoday at 5:41 AM

From the perspective of a local llm user, I think the qat doesn't solve the major problem of the gemma models.

Gemma family (gen 1 to gen 4) is consistent with extreme range of activations, i.e., 600000, essentially forcing people to use bf16 kv cache and accept a short context window, e.g., 31b, iq4_xs quantization, 100k context window on 32gb memory. Or, people use q8 kv cache, 200k context window, and accept a large performance penalty.

In contrast, for qwen 3.5 family, the largest activation is below 2000, making q8 or even lower-precision kv cache essentially free estates. Together with linear attention, which doesn't require kv cache, full 262k context window can be easily reached.

Qat training with w4a16 target, while improving performance on inference with low-precision weighs, doesn't solve kv cache problem at all.

In the end, a qat is a qat, and there are unseen efforts behind qat checkpoints. Thank you gemma team for releasing qat checkpoints.

show 1 reply
Catloafdevyesterday at 8:28 PM

Being able to run the 12B on 8gb VRAM is huge. It's crazy to see how fast these small local models have evolved.

netduryesterday at 5:16 PM

had a good run with Gemma 4 E2B Unsloth 4Q: https://youtube.com/shorts/XLsAnz5aAAI

The E4B model doesn’t fit on my phone TPU, so it swaps to RAM, the QAT version means more accuracy, good!

show 2 replies
jack_pptoday at 12:09 AM

Ran hf.co/google/gemma-4-12B-it-qat-q4_0-gguf:Q4_0 with ollama on a AMD Ryzen 9 8940HX, NVIDIA GeForce RTX 5060 (8 GB), 14 GB RAM laptop and it is suprisingly fast

WhiteDawnyesterday at 6:50 PM

Once someone generates a MTP layer for 26B A4B 4 QAT I'll be singing from the hills with my 5 year old GPU.

show 2 replies
somewhatrandom9yesterday at 6:09 PM

Could these quantized models make MTP (Multi-Token Prediction) significantly faster when used as drafters for larger regular Gemma 4 models?

show 1 reply
nicman23today at 7:24 AM

the new 4 12b model replaced qwen3.6 27b for me. the task i am doing is a bit specific, validating if a stamp has the correct name but the ones that it could not see maybe a 30 percent were easily discerned.

superkuhtoday at 2:11 AM

I wish they would release the base (non instruction tuned) models for use with pattern completion.

show 1 reply
cr3cr3yesterday at 6:05 PM

For a moment I got excited thinking QAT is Intel Quick Assist Technology...

show 1 reply
nazgul17today at 12:48 AM

I don't see these QAT models on Edge Gallery; just the BF16 models are there. Is there anything I am missing?

zkmonyesterday at 7:24 PM

How can the smaller Unsloth GGUF quant can beat the original google quant? (ref: unsloth/gemma-4-31B-it-qat-GGUF)

show 1 reply
Kylejeong21today at 12:19 AM

google pixel intelligence may beat apple intelligence

redox99yesterday at 6:53 PM

I was just testing Gemma E2B and E4B yesterday, and they are just too dumb to be useful outside of niche use cases.

Besides, there's no good agent on Android. Having a model that can't run web searches and browse websites is limited in use, particularly small models that really need to be grounded on search results to be factual, because they can't memorize enough.

Edit: I'd like to know what kind of usage the people that seem to disagree and downvoted this are having.

show 1 reply
refulgentisyesterday at 5:22 PM

@google.com'ers, there are no GGUFs (blog says there is)

show 1 reply
comparedgeyesterday at 6:21 PM

[flagged]

Pixel-Labsyesterday at 6:29 PM

[flagged]

spacebaconyesterday at 6:49 PM

[flagged]

steno132yesterday at 8:35 PM

I don't get this obsession with smaller models. I've been using Claude and GPT models for years and have had zero issues with them.

I see absolutely no benefit to me as a end user for a local model which is going to take up more of my CPU and memory and slow down my machine. I almost always have Internet and if I don't then not having access to a AI model is the least of my concerns.

show 6 replies