logoalt Hacker News

simonwyesterday at 5:58 PM22 repliesview on HN

The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...


Replies

stkaiyesterday at 6:56 PM

Would love to find out they're overfitting for pelican drawings.

show 4 replies
gcanyonyesterday at 6:44 PM

One aspect of this is that apparently most people can't draw a bicycle much better than this: they get the elements of the frame wrong, mess up the geometry, etc.

show 4 replies
einrealistyesterday at 6:10 PM

They trained for it. That's the +0.1!

etwiggyesterday at 10:00 PM

If we do get paperclipped, I hope it is of the "cycling pelican" variety. Thanks for your important contribution to alignment Simon!

zahlmanyesterday at 9:10 PM

Do you find that word choices like "generate" (as opposed to "create", "author", "write" etc.) influence the model's success?

Also, is it bad that I almost immediately noticed that both of the pelican's legs are on the same side of the bicycle, but I had to look up an image on Wikipedia to confirm that they shouldn't have long necks?

Also, have you tried iterating prompts on this test to see if you can get more realistic results? (How much does it help to make them look up reference images first?)

show 1 reply
athrowaway3zyesterday at 6:15 PM

This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's.

They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.

What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.

As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.

hoeoekyesterday at 6:07 PM

This really is my favorite benchmark

eaf7e281yesterday at 6:11 PM

There's no way they actually work on training this.

show 3 replies
beemboyyesterday at 8:57 PM

Isn't there a point at which it trains itself on these various outputs, or someone somewhere draws one and feeds it into the model so as to pass this benchmark?

bityardyesterday at 7:37 PM

Well, the clouds are upside-down, so I don't think I can give it a pass.

nine_kyesterday at 7:43 PM

I suppose the pelican must be now specifically trained for, since it's a well-known benchmark.

7777777philyesterday at 6:13 PM

best pelican so far would you say? Or where does it rank in the pelican benchmark?

show 1 reply
nubgyesterday at 6:04 PM

What about the Pelo2 benchmark? (the gray bird that is not gray)

copilot_king_2yesterday at 6:17 PM

I'm firing all of my developers this afternoon.

show 2 replies
6thbityesterday at 8:23 PM

do you have a gif? i need an evolving pelican gif

risyachkayesterday at 8:06 PM

Pretty sure at this point they train it on pelicans

ares623yesterday at 6:02 PM

Can it draw a different bird on a bike?

show 1 reply
DetroitThrowyesterday at 6:01 PM

The ears on top are a cute touch

iujasdkjfasfyesterday at 9:20 PM

[dead]

behnamohyesterday at 6:35 PM

[flagged]

show 2 replies
fullstackchrisyesterday at 9:21 PM

[flagged]

show 1 reply