logoalt Hacker News

ai_fry_ur_brainyesterday at 9:15 PM1 replyview on HN

From what I know about batch processing/ concurrency in inference this is a pipe dream... Or its going to cost an arm and a leg. I think they're lying or its going to be a much smaller model and not "frontier"


Replies

kolinkotoday at 7:20 AM

You have speculative decoding that easily increases speed 2-4 times with no loss of quality, and of course MoA architectures that speed up inference 10 times or more, although with some quality loss.

Better hardware, and other techniques on top of that and you speed up even further.