logoalt Hacker News

kolinkoyesterday at 7:20 AM0 repliesview on HN

You have speculative decoding that easily increases speed 2-4 times with no loss of quality, and of course MoA architectures that speed up inference 10 times or more, although with some quality loss.

Better hardware, and other techniques on top of that and you speed up even further.