Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...
Solid bird, not a great bicycle frame.
Now this is the test that matters, cheers Simon.
This Pelican benchmark has become irrelevant. SVG is already ubiquitous.
We need a new, authentic scenario.
How many pelican riding bicycle SVGs were there before this test existed? What if the training data is being polluted with all these wonky results...
The bird not having wings, but all of us calling it a 'solid bird' is one of the most telling examples of the AI expectations gap yet. We even see its own reasoning say it needs 'webbed feet' which are nowhere to be found in the image.
This pattern of considering 90% accuracy (like the level we've seemingly we've stalled out on for the MMLU and AIME) to be 'solved' is really concerning for me.
AGI has to be 100% right 100% of the time to be AGI and we aren't being tough enough on these systems in our evaluations. We're moving on to new and impressive tasks toward some imagined AGI goal without even trying to find out if we can make true Artificial Niche Intelligence.
Thank you for continuing to maintain the only benchmarking system that matters!
Context for the unaware: https://simonwillison.net/tags/pelican-riding-a-bicycle/