Is 7 extra percent on HLE benchmark really worth the cost of running an entire ensemble of models?

wavemode • today at 6:29 PM • 2 replies • view on HN

Replies

I mentioned in another comment that I make sure the cost/time is within 1.25x of the next best single-model run. So it's not perfect, but I think that aspect will only get better with time.

Of course I'm biased, but using Sup has been great for me personally. Even disregarding the HLE score, having many different perspectives in the answers, and most importantly the combined answer, has been very helpful in feedback for architectural decisions I make for Sup, and many other questions I would normally ask ChatGPT/Gemini/Claude/Grok individually.

kelseyfrog • today at 6:52 PM

Depends on the use-case and requirements.

alt Hacker News

Replies