> My M5 Pro can generate 130 tok/s (4 streams) on Gemma 4 26B. This seems high. At which q...

nnx • today at 6:10 AM • 1 reply • view on HN

> My M5 Pro can generate 130 tok/s (4 streams) on Gemma 4 26B.

This seems high. At which quantization? Using LM Studio or something else?

Note: Darkbloom seems to run everything on Q8 MLX.

pants2 • today at 10:43 AM

Ah good point, this is using Q4, benchmarked total throughout serving with Llama.cpp.

alt Hacker News