logoalt Hacker News

goldenarmtoday at 8:59 AM0 repliesview on HN

Consumer and server hardware are quite different, especially Google's TPUs. They notably have much larger mixture-of-experts ratios and more complex caching systems. At such scale and inference budgets, they are incentivised to optimize as much as possible.

Also Google Deepmins has a six month embargo on strategic papers, so I bet the juiciest quantization tech isn't public yet.