logoalt Hacker News

simonwyesterday at 5:06 PM22 repliesview on HN

I generated pelicans riding bicycles on both thinking level low and thinking level high:

https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...

The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.

For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...


Replies

keyletoday at 12:11 AM

It's pretty safe to say that AI will be used on the battlefield making real life and death decisions before it will be able to render a decent pelican on a bike in SVG.

show 3 replies
GistNoesisyesterday at 6:12 PM

> the bicycle frame is the correct shape

No, the handlebar is wrong. The handle bar is rotating the frame instead of rotating the front wheel. The handle bar should be mounted on the same line as the front wheel is.

Hopefully 4.9 will read my comments :)

show 2 replies
eminence32yesterday at 8:30 PM

I bet someone shares this link every time you post about bicycles, but since I didn't see anyone share it yet in this thread, I'll take the opportunity to do so:

https://www.gianlucagimini.it/portfolio-item/velocipedia/

Turns out even humans can be pretty bad at drawing bicycles :)

show 4 replies
jonas21yesterday at 5:20 PM

Glad to see that the "high thinking" level adds a helmet. Always a smart choice.

show 1 reply
lysecrettoday at 1:56 PM

Sadly I think the correlation between this benchmark and performance is starting to break down imo. Still a legendary idea will be remembered and ingrained in the models forever haha

simonwyesterday at 7:46 PM

Here's pelicans in all of the thinking levels - low, medium, high, xhigh, max

https://tools.simonwillison.net/markdown-svg-renderer#url=ht...

show 4 replies
spmartin823yesterday at 5:32 PM

You've peed in the pool Simon, this has to be a part of the internal evals by now! You got to try something new - maybe a panda in a canoe?

show 2 replies
ceroxylonyesterday at 5:33 PM

I really like that thinking level high gave the pelican a helmet.

Xunjinyesterday at 5:23 PM

Hey simonw I love your test, do you think using thinking level "max" makes sense for this test? I would love to see the results about it.

show 1 reply
alex_duftoday at 9:28 AM

It's funny that we've reached the level where LLMs draw more correct bikes than any random person

yanis_tyesterday at 5:15 PM

Simon, is your pelican test really captures differences among models or should you at least try like 10 times or something to average the random effects

show 1 reply
silisiliyesterday at 6:53 PM

The vast majority (if not all) of these make it impossible to turn, among other fun things. Only out of curiosity, have you tried prompting further with how a bike must operate to see if it does the right thing?

show 1 reply
impalallamayesterday at 9:38 PM

I actually like the 4.7 the most, interestingly enough. Not like you can "objectively" weight artistic output like this.

toastmaster11yesterday at 6:12 PM

I find the most miraculous thing about 4.7 to be that the pelican is facing left, wonder why the right facing everything is so ubiquitous in these images.

show 3 replies
prmoustacheyesterday at 10:41 PM

I don't see how a frame without a headtube can be "the correct shape".

timsuchanekyesterday at 6:00 PM

thanks for always providing this very much on time. I'm wondering what the next, harder challenge could be? Maybe some animated svg?

1atticeyesterday at 5:18 PM

That little red hat on hard mode is sending me. 4.8 has whimsy

nickvecyesterday at 5:11 PM

Is the "opossum riding an e-scooter" benchmark in the works for Opus 4.8? ;)

show 2 replies
fragmedeyesterday at 8:47 PM

For comparison, what's GPT-5.5 producing today?

show 1 reply
highwaylightsyesterday at 6:07 PM

Am I allowed to say that pelican's little helmet is adorable? I can't provide a strong computational proof, or even a shred of anecdata...

...but that pelican's little helmet is adorable.

whalesaladyesterday at 6:43 PM

Eventually the frontier model folks are going to pick up on your pelican on a bike test and bake-in flawless results for that particular request.

onlyrealcuzzoyesterday at 5:09 PM

4.7 reigns supreme IMO.