Healthy! prefill: 30.91 t/s, generation: 29.58 t/s From

simonw • yesterday at 11:24 PM • 5 replies • view on HN

Healthy!

  prefill: 30.91 t/s, generation: 29.58 t/s

From https://gist.github.com/simonw/31127f9025845c4c9b10c3e0d8612...

Replies

incidentist • today at 4:31 PM

Someone is working on a fork that is optimized for M5, might be worth a look: https://github.com/Swival/ds4-m5

antirez • today at 2:14 PM

Prefill is 400 t/s in that hardware. Just if the prompt is very short you can't see the real speed and it will default to single token context processing.

➕ show 1 reply

embedding-shape • today at 12:09 AM

Comparison with a RTX Pro 6000, with DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf:

prefill: 121.76 t/s, generation: 47.85 t/s

Main target seems to be Apple's Metal, so makes sense. Might be fun to see how fast one could make it go though :) The model seems really good too, even though it's in IQ2.

xienze • yesterday at 11:34 PM

I don't want to be a jerk but 31t/s prefill is basically unusable in an agentic situation. A mere 10k in context and you're sitting there for 5+ minutes before the first token is generated.

➕ show 2 replies

rtpg • today at 1:51 AM

what are token speeds like for frontier models, if that gives a rough idea of how much "slower" slow is?

alt Hacker News

Replies