Gets 10/10 on my potato benchmarks:

XCSme • today at 6:54 PM • 3 replies • view on HN

Gets 10/10 on my potato benchmarks: https://aibenchy.com/model/google-gemini-3-1-pro-preview-med...

Replies

Now I need to write more tests.

It's a bit hard to trick reasoning models, because they explore a lot of the angles of a problem, and they might accidentally have an "a-ha" moment that leads them on the right path. It's a bit like doing random sampling and stumbling upon the right result after doing gradient descent from those points.

thevinter • today at 8:42 PM

Are you intentionally keeping the benchmarks private?

➕ show 1 reply

XCSme • today at 8:55 PM

Added one more test, which surprisingly gemini flash 3 reasoning passes, but gemini 3.1 pro not

alt Hacker News

Replies