logoalt Hacker News

andybakyesterday at 1:26 PM3 repliesview on HN

> draw a pretty good pelican on a bike.

You mean the famously hard task? The one picked because it stretches frontier models to their limits?


Replies

munk-ayesterday at 5:02 PM

It was a famously hard task. It was an ingenious idea for an unexpected task that falls outside of the bounds of predictable normal input but is still readily comprehended by the public.

Unfortunately, as soon as it's a famously hard task trainers know they need to succeed at it and it loses a lot of the power to detect correctness.

quantummagicyesterday at 2:11 PM

In fairness, that isn't due to a lack of compute.

daveguyyesterday at 2:09 PM

https://simonwillison.net/2026/Apr/22/qwen36-27b/

Maybe this is an example of training overfit. But it won't be too long before local models chew through the "famously hard tasks". Except possibly ARC-AGI. That's one benchmark that is still developing with capabilities. And every time a new ARC-AGI benchmark is released it make the SOTA LLMs look pathetic. Because there is very little understanding or transferability with LLMs. But in terms of benchmark-able micro tasks, the local LLMs are improving.