logoalt Hacker News

simonwyesterday at 5:47 PM5 repliesview on HN

Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...

Solid bird, not a great bicycle frame.


Replies

btownyesterday at 6:01 PM

Thank you for continuing to maintain the only benchmarking system that matters!

Context for the unaware: https://simonwillison.net/tags/pelican-riding-a-bicycle/

show 2 replies
_joelyesterday at 6:09 PM

Now this is the test that matters, cheers Simon.

solarizedyesterday at 8:22 PM

This Pelican benchmark has become irrelevant. SVG is already ubiquitous.

We need a new, authentic scenario.

show 2 replies
pwythonyesterday at 6:25 PM

How many pelican riding bicycle SVGs were there before this test existed? What if the training data is being polluted with all these wonky results...

show 2 replies
RC_ITRyesterday at 9:23 PM

The bird not having wings, but all of us calling it a 'solid bird' is one of the most telling examples of the AI expectations gap yet. We even see its own reasoning say it needs 'webbed feet' which are nowhere to be found in the image.

This pattern of considering 90% accuracy (like the level we've seemingly we've stalled out on for the MMLU and AIME) to be 'solved' is really concerning for me.

AGI has to be 100% right 100% of the time to be AGI and we aren't being tough enough on these systems in our evaluations. We're moving on to new and impressive tasks toward some imagined AGI goal without even trying to find out if we can make true Artificial Niche Intelligence.

show 1 reply