Tiny model overfit on benchmark published 3 years prior to its training. News at 10
It wasn't important enough to make the 11 o'clock program.
But GPT-3.5 was benchmaxxing too.
[dead]
It wasn't important enough to make the 11 o'clock program.