1. Would be good to benchmark at least one other model from a different family to see if it indeed g...

deaux • yesterday at 3:29 PM • 2 replies • view on HN

1. Would be good to benchmark at least one other model from a different family to see if it indeed generalizes. Minimax 2.7 seems a good candidate to keep it affordable. Until then we can't really tell if it's just overfit on Gemini 3 Flash.

2. Until then your landing page needs to mention all the numbers are just from running on Gemini 3 Flash. Currently there's no mention at all of Gemini.

3. Assuming that cheaper also means faster in this case where model is equal? If so, then why not add this to the benchmarks to highlight another advantage - time until completion of the tasks. If it's the opposite and it takes longer (seems unlikely), then it would be transparent to note this.

4. Would be good to note if it does or does not support skills, (nested) AGENTS.md, MCP and so on for people considering migrating.

Replies

GodelNumbering • yesterday at 4:53 PM

Good points.

1. I have been trying to benchmark openweights models but keep running into timeouts due to slow inference (terminal bench tasks have strict timeouts that you are not allowed to modify). Posted my frustration here https://www.reddit.com/r/LocalLLaMA/comments/1stgt39/the_fru...

2. Done (updated github readme)

3. Yes, on an average the times were shorter, but I did not benchmark it because at random times, the model outputs get slower, so it is not a rigorous benchmark

4. Added info on this too

➕ show 1 reply

Clueed • today at 6:32 AM

I tried it with minimax 2.7 and it really didn’t like the editing tool; collapsing rather quickly to using sed to edit files.

I guess it makes sense that models don’t generalize perfectly to arbitrary tools but are biased to those in its training data, especially for a common operation like editing files.

The Gemini family might be a good pick here since it generally underperforms in agentic tasks (due to lack of training data or other reasons) and thus might not have this inherent bias towards specific tools.

alt Hacker News

Replies