I realize my SpongeBob post came off flippant, and that wasn't the intent. The Spongebob ASCII test (picked up from Qwen's own Twitter) is explicitly a rote-memorization probe; bigger dense models usually ace it because sheer parameter count can store the sequence
With Qwen3's sparse-MoE, though, the path to that memory is noisier: two extra stochastic draws (a) which expert(s) fire, (b) which token gets sampled from them. Add the new gated-attention and multi-token heads and you've got a pipeline where a single routing flake or a dud expert can break vertical alignment halfway down the picture.
Anyway, I think qwen3-coder was uniquely trained on this - so it's not a fair comparison. Here are some other qwen3 models:
Model: chutes/Qwen/Qwen3-235B-A22B
/~\
( * * )
( o o o )
\ - /
\ /\ /
\ /
\/
/|||\
/|||||\
/||||||||\
( o o o )
\ W /
\___/
Model: chutes/Qwen/Qwen3-235B-A22B-Instruct-2507 /\_/\
( o.o )
> ^ <
/ \
| |
| |
\ /
'-'-'
Model: chutes/Qwen/Qwen3-235B-A22B-Thinking-2507 .-----------.
| [] [] |
| |
| __ __ |
| | || | |
| |__||__| |
| |
'-----------'
Model: chutes/Qwen/Qwen3-Next-80B-A3B-Instruct __
/ \
/ \
/ \
/ \
/__________\
| o o |
| __ |
| \____/ |
| |
| ________ |
| \ / |
| \____/ |
| |
|____________|
___________
/ \
/ \
| _______ |
| | | |
| | ___ | |
| | | | | |
| | |___| | |
| |_______| |
| |
|_______________|
Model: chutes/Qwen/Qwen3-Next-80B-A3B-Thinking .-.
/ \
| o o|
| > |
| ---|
\___/
Model: chutes/Qwen/Qwen3-30B-A3B-Instruct-2507 _________________________
/ \
| _ _ _ _ |
| / \ / \ / \ / \ |
| | | | | | | | | |
| \_/ \_/ \_/ \_/ |
| |
| _ _ _ _ |
| / \ / \ / \ / \ |
| | | | | | | | | |
| \_/ \_/ \_/ \_/ |
| |
| SpongeBob SquarePants |
|_________________________|