logoalt Hacker News

Tomjosetj31today at 6:50 PM1 replyview on HN

Impressive result on HLE if the methodology holds up. One thing I'd want to understand better: how much of the gain comes from the entropy weighting specifically vs. simply having more compute via parallel inference? Would be curious to see an ablation — same models, same budget, but with naive majority voting instead. That would isolate the actual contribution of your confidence-weighting approach.


Replies

scottmutoday at 8:01 PM

Great question. What I can say is we experimented a _ton_. If you take a basic approach and simply ask the same prompt of a bunch of LLMs and then ask another LLM to combine the results, you'll get a pretty poor answer. At best, you'll get a response that is the average of the ensemble, which by definition is going to be worse than the best model of the ensemble. Of course, you're going to want a mechanism to choose the ensemble effectively. At worst, you'll regurgitate the worst model of the ensemble. And you'll have the added expense and potential latency, too. Not a good solution at all.

We didn't experiment with different ensemble mechanisms rigorously enough for a research paper. We will, though.

Majority voting was actually how we started, and we came up with great mechanisms for stopping early, saving token costs and time, along with other interesting things we could do with that simple mechanism. The issue we had was that the orchestration could already choose a model beforehand almost as good (according to simpler benchmarks than HLE we ran at the time) as majority voting could pick after the responses were complete. And we tried many voting mechanisms, such as all models in the ensemble voting on all others.

An ablation study would be great to do now, with many other ideas we've played with. We have better benchmarks than we did just a few months ago, and it would be great to understand the tradeoffs of different approaches so that there could be alternative options for different use cases.