Note that I was only commenting on modern quantized LLM's that basically avoid formats like FP1...

zozbot234 • 05/03/2025 • 1 reply • view on HN

Note that I was only commenting on modern quantized LLM's that basically avoid formats like FP16 or INT8, preferring lower precision wherever feasible. When in-memory model values must be padded to FP16/INT8 this slashes your effective use of memory bandwidth, which is what determines token generation speed. So the only feasible benefits are really in the prompt pre-processing phase, and even then only in lower power use compared to GPU, not really in higher speed.

Replies

kamranjon • 05/03/2025

That's really interesting! I didn't know that about the padding behavior here. I am interested to know which models this would include? I know Gemma 3 raw is bf16 - are you just talking about the quantized versions of these? Or are models being released purely as quantized versions these days? I know Google just released a QAT (Quantization Aware Training) model of Gemma 3 27b - but that base model was already released.

➕ show 1 reply

alt Hacker News

Replies