logoalt Hacker News

simonwyesterday at 5:38 PM18 repliesview on HN

I've been running this on my laptop with the Unsloth 20.9GB GGUF in LM Studio: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/mai...

It drew a better pelican riding a bicycle than Opus 4.7 did! https://simonwillison.net/2026/Apr/16/qwen-beats-opus/


Replies

realityfactchexyesterday at 10:55 PM

Here's a reproduction attempt (LM Studio, same Qwen3.6-35B-A3B-GGUF model as linked in parent, M1 Max 64GB, <90 seconds):

https://files.catbox.moe/r3oru2.png

- My Qwen 3.6 result had sun and cloud in sky, similar to the second Opus 4.7 result in Simon's post.

- My Qwen 3.6 result had no grass (except as a green line), but all three results in Simon's post had grass (thick).

- My Qwen 3.6 result had visible "tailing air motion" like Simon's Qwen 3.6 result.

- My Qwen 3.6 result had a "sun with halo" effect that none of Simon's results had.

But, I know, it's more about the pelican and the bicycle.

show 1 reply
GistNoesisyesterday at 10:24 PM

Thanks for pointing to the GGUF.

I just tried this GGUF with llama.cpp in its UD Q4_K_XL version on my custom agentic oritened task consisiting of wiki exploration and automatic database building ( https://github.com/GistNoesis/Shoggoth.db/ )

I noted a nice improvement over QWen3.5 in its ability to discover new creatures in the open ended searching task, but I've not quantified it yet with numbers. It also seems faster, at around 140 token/s compared to 100 token/s , but that's maybe due to some different configuration options.

Some little difference with QWen3.5 : to avoid crashes due to lack of memory in multimodal I had to pass --no-mmproj-offload to disable the gpu offload to convert the images to tokens otherwise it would crash for high resolutions images. I also used quantized kv store by passing -ctk q8_0 -ctv q8_0 and with a ctx-size 150000 it only need 23099 MiB of device memory which means no partial RAM offloading when I use a RTX 4090.

jubilantiyesterday at 6:16 PM

I wonder when pelican riding a bicycle will be useless as an evaluation task. The point was that it was something weird nobody had ever really thought about before, not in the benchmarks or even something a team would run internally. But now I'd bet internally this is one of the new Shirley Cards.

show 5 replies
culiyesterday at 7:06 PM

the more I look at these images the more convinced I become that world models are the major missing piece and that these really are ultimately just stochastic sentence machines. Maybe Chomsky was right

show 1 reply
kelnosyesterday at 8:23 PM

I'm not sure how you can give the flamingo win to Qwen:

* It's sitting on the tire, not the seat.

* Is that weird white and black thing supposed to be a beak? If so, it's sticking out of the side of its face rather than the center.

* The wheel spokes are bizarre.

* One of the flamingo's legs doesn't extend to the pedal.

* If you look closely at the sunglasses, they're semi-transparent, and the flamingo only has one eye! Or the other eye is just on a different part of its face, which means the sunglasses aren't positioned correctly. Or the other eye isn't.

* (subjective) The sunglasses and bowtie are cute, but you didn't ask for them, so I'd actually dock points for that.

* (subjective) I guess flamingos have multiple tail feathers, but it looks kinda odd as drawn.

In contrast, Opus's flamingo isn't as detailed or fancy, but more or less all of it looks correct.

show 2 replies
bertiliyesterday at 5:53 PM

It's fascinating that a $999 Mac Mini (M4 32GB) with almost similar wattage as a human brain gets us this far.

show 1 reply
rdslwyesterday at 7:11 PM

interesting, I just tried this very model, unsloth, Q8, so in theory more capable than Simon's Q4, and get those three "pelicans". definitely NOT opus quality. lmstudio, via Simon's llm, but not apple/mlx. Of course the same short prompt.

Simon, any ideas?

https://ibb.co/gFvwzf7M

https://ibb.co/dYHRC3y

https://ibb.co/FLc6kggm (tried here temperature 0.7 instead of pure defaults)

show 1 reply
cyclopeanutopiayesterday at 5:59 PM

But that you also gave a win to Qwen on flamingo is pretty outrageous! :)

Tthe right one looks much better, plus adding sunglasses without prompting is not that great. Hopefully it won't add some backdoor to the generated code without asking. ;)

show 1 reply
prirunyesterday at 6:41 PM

The flamingo on Qwen's unicycle is sitting on the tire, not the seat. That wins because of sunglasses?

show 3 replies
jaspangliayesterday at 10:25 PM

The real question is what the next truly weird, un-optimized prompt will be. Something involving a sloth debugging a quantum computer in MS Paint?"

monksyyesterday at 8:59 PM

Hey I really enjoy your blog. On some things I end up finding a blog post of yours thats a year+ old and at other times, you and I are investigating similar things. I just pulled Qwen3.6 - 35b -A3B (Can't believe thats a A3B coming from 35b).

I'm impressed about the reach of your blog, and I'm hoping to get into blogging similar things. I currently have a lot on my backlog to blog about.

In short, keep up the good work with an interesting blog!

bwv848yesterday at 9:00 PM

I've been trying the Q4_K_M version, and sometimes it gets stuck in a loop. Gemma 4 doesn’t have this issue.

show 2 replies
MeteorMarcyesterday at 7:11 PM

Interesting, qwen has the pelican driving on the left lane. Coincidence or has it something to do with the workers providing the RL data?

show 1 reply
jamwiseyesterday at 5:41 PM

I've had some really gnarly SVGs from Claude. Here's what I got after many iterations trying to draw a hand: https://imgur.com/a/X4Jqius

show 1 reply
danielhanchenyesterday at 5:50 PM

Oh that is pretty good! And the SVG one!

quietsegfaultyesterday at 11:37 PM

The qwen flamingo looks like it’s smoking’ a doobie.

slekkeryesterday at 5:48 PM

How does it do with the "car wash" benchmark? :D