Surpassing vLLM with a Generated Inference Stack

10 points • by lukebechtel • today at 3:12 PM • 4 comments • view on HN

Comments

Why do they need to run benchmarks to confirm performance? Can't they run an example prompt and verify they get the exact same output token probabilities for all prompts? The fact that they are not doing this makes me suspicious that they are in fact not doing the exact same thing as vLLM.

It is also a bit weird that they are not incorporating speculative decoding, that seems like a critical performance optimization, especially for decode heavy workloads.

➕ show 1 reply

rfw300 • today at 6:57 PM

OK... we need way more information than this to validate this claim! I can run Qwen-8B at 1 billion tokens per second if you don't check the model's output quality. No information is given about the source code, correctness, batching, benchmark results, quantization, etc. etc. etc.

➕ show 1 reply

alt Hacker News

Surpassing vLLM with a Generated Inference Stack

Comments