logoalt Hacker News

zozbot234yesterday at 11:59 PM2 repliesview on HN

You can run multiple inferences in parallel on the same set of weights, that's what batching is. Given enough parallelization it can be almost entirely compute-limited, at least for small context (max ~10GB per request apparently, but that's for 1M tokens!)


Replies

Havoctoday at 1:57 PM

For offline work that's fine I guess, but batched or not <1tks is largely unusable for most usage cases

show 1 reply
happyPersonRtoday at 1:27 PM

Yes I think what this demonstrates that folks are missing is that now optimization for specific scenarios is quite possible.