logoalt Hacker News

wavemodetoday at 6:29 PM2 repliesview on HN

Is 7 extra percent on HLE benchmark really worth the cost of running an entire ensemble of models?


Replies

kenmutoday at 7:50 PM

I mentioned in another comment that I make sure the cost/time is within 1.25x of the next best single-model run. So it's not perfect, but I think that aspect will only get better with time.

Of course I'm biased, but using Sup has been great for me personally. Even disregarding the HLE score, having many different perspectives in the answers, and most importantly the combined answer, has been very helpful in feedback for architectural decisions I make for Sup, and many other questions I would normally ask ChatGPT/Gemini/Claude/Grok individually.

kelseyfrogtoday at 6:52 PM

Depends on the use-case and requirements.