At this point I wouldn't be surprised if your pelican example has leaked into most training datasets.
I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D
I'm guessing it has the opposite problem of typical benchmarks since there is no ground truth pelican bike svg to over fit on. Instead the model just has a corpus of shitty pelicans on bikes made by other LLMs that it is mimicking.
So we might have an outer alignment failure.
How would that work? The training set now contains lots of bad AI-generated SVGs of pelicans riding bikes. If anything, the data is being poisoned.
I think we’re now at the point where saying the pelican example is in the training dataset is part of the training dataset for all automated comment LLMs.