logoalt Hacker News

mememememememoyesterday at 8:38 AM2 repliesview on HN

You could timeout. You could trade them off dynamically.

I.e. you get 3 replies. 80% confidence. You decide at 80% you are fairly good but happy to wait 5 seconds for completion / 500ms for time to first token. If either breaches you give the current answer.

But if you are at 5% you wait for 60s total/2s for a token since the upside of that unspoken model is much higher.

Basically wagering time for quality in a dynamic prediction market in front of the LLM.


Replies

kenmuyesterday at 8:16 PM

Love your idea. We have timeout mechanisms and we originally would be pretty aggressive with timeouts based on both time and response length to balance accuracy and speed. There’s research that longer responses tend to be less accurate (when compared to other responses to the same prompt). So we came up with an algorithm that optimized this very effectively. However, we eventually removed this mechanism to avoid losing any accuracy or comprehensiveness. We have other systems, including confidence scoring, that are pretty effective at judging long responses and weighting them accordingly.

We may reintroduce some of the above with user-configurable levers.

all2yesterday at 6:17 PM

If we treat LLM output as a manufacturing output if you have three 80% probabilities you actually have something like 0.80.80.8 -> 0.512 or 51%.

show 1 reply