I mentioned in another comment that I make sure the cost/time is within 1.25x of the next best single-model run. So it's not perfect, but I think that aspect will only get better with time.
Of course I'm biased, but using Sup has been great for me personally. Even disregarding the HLE score, having many different perspectives in the answers, and most importantly the combined answer, has been very helpful in feedback for architectural decisions I make for Sup, and many other questions I would normally ask ChatGPT/Gemini/Claude/Grok individually.
I mentioned in another comment that I make sure the cost/time is within 1.25x of the next best single-model run. So it's not perfect, but I think that aspect will only get better with time.
Of course I'm biased, but using Sup has been great for me personally. Even disregarding the HLE score, having many different perspectives in the answers, and most importantly the combined answer, has been very helpful in feedback for architectural decisions I make for Sup, and many other questions I would normally ask ChatGPT/Gemini/Claude/Grok individually.