It would be trivial to detect such gaming, tho. That's the beauty of the test, and that's ...

NitpickLawyer • yesterday at 4:29 PM • 2 replies • view on HN

It would be trivial to detect such gaming, tho. That's the beauty of the test, and that's why they're probably not doing it. If a model draws "perfect" (whatever that means) pelicans on a bike, you start testing for owls riding a lawnmower, or crows riding a unicycle, or x _verb_ on y ...

Replies

kennykartman • today at 2:42 PM

Sure, I agree! I did not mean to see better results because LLMs improved significantly in their visual-spatial reasoning, but simply because I expected more people drawing SVGs of pelicans on bikes and having more LLMs ingesting them. This is what I find a bit surprising.

Sharlin • yesterday at 4:38 PM

It could still be special-case RLHF trained, just not up to perfection.

alt Hacker News

Replies