logoalt Hacker News

zozbot234last Saturday at 6:42 PM1 replyview on HN

Note that I was only commenting on modern quantized LLM's that basically avoid formats like FP16 or INT8, preferring lower precision wherever feasible. When in-memory model values must be padded to FP16/INT8 this slashes your effective use of memory bandwidth, which is what determines token generation speed. So the only feasible benefits are really in the prompt pre-processing phase, and even then only in lower power use compared to GPU, not really in higher speed.


Replies

kamranjonlast Saturday at 6:56 PM

That's really interesting! I didn't know that about the padding behavior here. I am interested to know which models this would include? I know Gemma 3 raw is bf16 - are you just talking about the quantized versions of these? Or are models being released purely as quantized versions these days? I know Google just released a QAT (Quantization Aware Training) model of Gemma 3 27b - but that base model was already released.

show 1 reply