Pelican generated via OpenRouter:

simonw • yesterday at 5:47 PM • 5 replies • view on HN

Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...

Solid bird, not a great bicycle frame.

Replies

btown • yesterday at 6:01 PM

Thank you for continuing to maintain the only benchmarking system that matters!

Context for the unaware: https://simonwillison.net/tags/pelican-riding-a-bicycle/

➕ show 2 replies

_joel • yesterday at 6:09 PM

Now this is the test that matters, cheers Simon.

solarized • yesterday at 8:22 PM

This Pelican benchmark has become irrelevant. SVG is already ubiquitous.

We need a new, authentic scenario.

➕ show 2 replies

pwython • yesterday at 6:25 PM

How many pelican riding bicycle SVGs were there before this test existed? What if the training data is being polluted with all these wonky results...

➕ show 2 replies

RC_ITR • yesterday at 9:23 PM

The bird not having wings, but all of us calling it a 'solid bird' is one of the most telling examples of the AI expectations gap yet. We even see its own reasoning say it needs 'webbed feet' which are nowhere to be found in the image.

This pattern of considering 90% accuracy (like the level we've seemingly we've stalled out on for the MMLU and AIME) to be 'solved' is really concerning for me.

AGI has to be 100% right 100% of the time to be AGI and we aren't being tough enough on these systems in our evaluations. We're moving on to new and impressive tasks toward some imagined AGI goal without even trying to find out if we can make true Artificial Niche Intelligence.

➕ show 1 reply

alt Hacker News

Replies