This is an excellent idea, but I worry about fairness during resource contention. I don't ofte...

freedomben • yesterday at 5:15 PM • 4 replies • view on HN

This is an excellent idea, but I worry about fairness during resource contention. I don't often need queries, but when I do it's often big and long. I wouldn't want to eat up the whole system when other users need it, but I also would want to have the cluster when I need it. How do you address a case like this?

Replies

pokstad • yesterday at 6:55 PM

This problem sounds like an excellent opportunity. We need a race to the bottom for hosting LLMs to democratize the tech and lower costs. I cheer on anyone who figures this out.

➕ show 1 reply

jrandolf • yesterday at 5:36 PM

We implement rate-limiting and queuing to ensure fairness, but if there are a massive amount of people with huge and long queries, then there will be waits. The question is whether people will do this and more often than not users will be idle.

➕ show 3 replies

zozbot234 • yesterday at 9:21 PM

Ultimately the most sensible way of handling this is you end up with "surge pricing" for the highest-priority tokens whenever the inference platform is congested, over and above the base subscription (but perhaps ultimately making the subscription a bit cheaper).

cyanydeez • yesterday at 8:59 PM

Also, cache ejection during contention qill degrade everyones service.

I question whether they actually understand LLMs at scale.

➕ show 1 reply

alt Hacker News

Replies