logoalt Hacker News

zozbot234yesterday at 11:07 PM0 repliesview on HN

12 tok/s can absolutely be "usable output" depending on what you're doing. I agree though that the 27B dense model often feels slow due to an overall weakness of memory throughput on that particular platform. Most real-world 120B models though will be MoE-based with only a small fraction of active parameters, and these run quite well. Also, dense models can benefit from batching, which is at least marginally viable with Qwen if you stick to shorter contexts and smaller batches.