I'm pretty sure eval time is token generation time where it's actually outputting new toke...

boutell • today at 10:47 AM • 3 replies • view on HN

I'm pretty sure eval time is token generation time where it's actually outputting new tokens. If you're getting a thousand per second on that, I'd love to know on what.

Replies

throwawayffffas • today at 7:17 PM

He meant prompt eval time, but have a look at these guys: https://www.youtube.com/watch?v=ndSA9T5yvmM

Over 2500 tokens per second on a single request. With 8 MI300X.

Majromax • today at 12:45 PM

From the prompt timings above, it seems like 'prompt eval time' is the equivalent to 'processing time for input tokens'.

Hyperscalers can perform this evaluation very quickly because evaluation can be significantly parallelized. The layer `i` output of token `j` only requires access to the layer `i-1` output of all previous tokens, so a parallel frontier develops. Token (0,0) [(token, layer)] is processed first, then tokens (0,1) and (1,0) can be processed in parallel, then (0,2), (1,1), and (2,0), and so on.

The maximum parallel width becomes equal to the number of layers in the model. Gemma 4 26B-A4B model discussed in this article evidently has 30 layers, giving a 30-fold speedup if the system were otherwise unconstrained (all layers can be run in parallel, and one full set of layer outputs is completed in the KV pass for each pass of the parallel sweep).

In the specific output above, however, the input prompt is only seven tokens long so there are probably considerable non-amortized spinup effects at play.

➕ show 1 reply

ekianjo • today at 1:38 PM

I meant prompt eval time.

alt Hacker News

Replies