logoalt Hacker News

sothatsityesterday at 9:01 AM0 repliesview on HN

Dylan Patel did analysis that suggests lower batch size and more speculative decoding leads to 2.5x more per-user throughput for 6x the cost for open models [0]. Seems plausible this could be what they are doing. We probably won't get to know for sure any time soon.

Regardless, they don't need to be using new hardware to get speedups like this. It's possible you just hit A/B testing and not newer hardware. I'd be surprised if they were using their latest hardware for inference tbh.

[0] https://nitter.net/dylan522p/status/2020302299827171430