logoalt Hacker News

btownyesterday at 6:01 PM3 repliesview on HN

Thank you for continuing to maintain the only benchmarking system that matters!

Context for the unaware: https://simonwillison.net/tags/pelican-riding-a-bicycle/


Replies

l_eoyesterday at 8:55 PM

They will start to max this benchmark as well at some point.

show 1 reply
gabiruhyesterday at 7:38 PM

It's interesting how some features, such as green grass, a blue sky, clouds, and the sun, are ubiquitous among all of these models' responses.

show 2 replies
segmondytoday at 1:01 AM

This is actually a good benchmark, I use to roll my eyes at it. Then I decided to apply the same idea and ask the models to generate SVG image of "something" not going to put it out there. There was a strong correlation between how good the models are and the image they generated. These were also no vision images, so I don't know if you are serious but this is a decent benchmark.