The bicycle frame is a bit wonky but the pelican itself is great:

simonw • yesterday at 5:58 PM • 22 replies • view on HN

The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...

Replies

stkai • yesterday at 6:56 PM

Would love to find out they're overfitting for pelican drawings.

➕ show 4 replies

gcanyon • yesterday at 6:44 PM

One aspect of this is that apparently most people can't draw a bicycle much better than this: they get the elements of the frame wrong, mess up the geometry, etc.

➕ show 4 replies

franze • yesterday at 8:43 PM

here the animated version https://claude.ai/public/artifacts/3db12520-eaea-4769-82be-7...

➕ show 1 reply

einrealist • yesterday at 6:10 PM

They trained for it. That's the +0.1!

etwigg • yesterday at 10:00 PM

If we do get paperclipped, I hope it is of the "cycling pelican" variety. Thanks for your important contribution to alignment Simon!

zahlman • yesterday at 9:10 PM

Do you find that word choices like "generate" (as opposed to "create", "author", "write" etc.) influence the model's success?

Also, is it bad that I almost immediately noticed that both of the pelican's legs are on the same side of the bicycle, but I had to look up an image on Wikipedia to confirm that they shouldn't have long necks?

Also, have you tried iterating prompts on this test to see if you can get more realistic results? (How much does it help to make them look up reference images first?)

➕ show 1 reply

athrowaway3z • yesterday at 6:15 PM

This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's.

They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.

What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.

As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.

hoeoek • yesterday at 6:07 PM

This really is my favorite benchmark

eaf7e281 • yesterday at 6:11 PM

There's no way they actually work on training this.

➕ show 3 replies

beemboy • yesterday at 8:57 PM

Isn't there a point at which it trains itself on these various outputs, or someone somewhere draws one and feeds it into the model so as to pass this benchmark?

bityard • yesterday at 7:37 PM

Well, the clouds are upside-down, so I don't think I can give it a pass.

nine_k • yesterday at 7:43 PM

I suppose the pelican must be now specifically trained for, since it's a well-known benchmark.

7777777phil • yesterday at 6:13 PM

best pelican so far would you say? Or where does it rank in the pelican benchmark?

➕ show 1 reply

nubg • yesterday at 6:04 PM

What about the Pelo2 benchmark? (the gray bird that is not gray)

copilot_king_2 • yesterday at 6:17 PM

I'm firing all of my developers this afternoon.

➕ show 2 replies