logoalt Hacker News

XCSmetoday at 6:54 PM3 repliesview on HN

Gets 10/10 on my potato benchmarks: https://aibenchy.com/model/google-gemini-3-1-pro-preview-med...


Replies

XCSmetoday at 6:58 PM

Now I need to write more tests.

It's a bit hard to trick reasoning models, because they explore a lot of the angles of a problem, and they might accidentally have an "a-ha" moment that leads them on the right path. It's a bit like doing random sampling and stumbling upon the right result after doing gradient descent from those points.

thevintertoday at 8:42 PM

Are you intentionally keeping the benchmarks private?

show 1 reply
XCSmetoday at 8:55 PM

Added one more test, which surprisingly gemini flash 3 reasoning passes, but gemini 3.1 pro not