logoalt Hacker News

vlovich123last Saturday at 1:34 AM1 replyview on HN

One classic problem in all ML is ensuring the benchmark is representative and that the algorithm isn’t overfitting the benchmark.

This remains an open problem for LLMs - we don’t have true AGI benchmarks and the LLMs are frequently learning the benchmark problems without actually necessarily getting that much better in real world. Gemini 3 has been hailed precisely because it’s delivered huge gains across the board that aren’t overfitting to benchmarks.


Replies

ipaddrlast Saturday at 2:11 AM

This could be a solved problem. Come up with problems not online and compare. Later use LLMs to sort through your problems and classify between easy-difficult

show 2 replies