logoalt Hacker News

qeternitytoday at 1:17 AM0 repliesview on HN

I would guess you haven't done this in practice. Yes, of course inference is memory bound at low batch sizes. This is why we run larger batch sizes!

Also there does not exist any batch size > 1 where per-request throughput is equal to bs=1. Doing any batching at all will slow all intra-batch requests down.