logoalt Hacker News

freedombenyesterday at 5:15 PM4 repliesview on HN

This is an excellent idea, but I worry about fairness during resource contention. I don't often need queries, but when I do it's often big and long. I wouldn't want to eat up the whole system when other users need it, but I also would want to have the cluster when I need it. How do you address a case like this?


Replies

pokstadyesterday at 6:55 PM

This problem sounds like an excellent opportunity. We need a race to the bottom for hosting LLMs to democratize the tech and lower costs. I cheer on anyone who figures this out.

show 1 reply
jrandolfyesterday at 5:36 PM

We implement rate-limiting and queuing to ensure fairness, but if there are a massive amount of people with huge and long queries, then there will be waits. The question is whether people will do this and more often than not users will be idle.

show 3 replies
zozbot234yesterday at 9:21 PM

Ultimately the most sensible way of handling this is you end up with "surge pricing" for the highest-priority tokens whenever the inference platform is congested, over and above the base subscription (but perhaps ultimately making the subscription a bit cheaper).

cyanydeezyesterday at 8:59 PM

Also, cache ejection during contention qill degrade everyones service.

I question whether they actually understand LLMs at scale.

show 1 reply