You can run multiple inferences in parallel on the same set of weights, that's what batching is...

zozbot234 • yesterday at 11:59 PM • 2 replies • view on HN

You can run multiple inferences in parallel on the same set of weights, that's what batching is. Given enough parallelization it can be almost entirely compute-limited, at least for small context (max ~10GB per request apparently, but that's for 1M tokens!)

Replies

Havoc • today at 1:57 PM

For offline work that's fine I guess, but batched or not <1tks is largely unusable for most usage cases

➕ show 1 reply

happyPersonR • today at 1:27 PM

Yes I think what this demonstrates that folks are missing is that now optimization for specific scenarios is quite possible.

alt Hacker News

Replies