You’re right, it’s a valid data point. But the U.K. government report is also a data point, and the Firefox report is a data point, and they suggest that it is, indeed, significantly better than current generation models. Maybe curl is significantly better hardened than most projects?
In any event, it barely matters. As Anthropic acknowledges, next level models are comings, theirs is only one of them. Current generation models are already good at things like tracing data flow through complex systems and there’s no reason to think that capability has topped out. So within a year it seems very likely we’ll have more than one commercially available model able to find vulnerabilities cheaply.
On the other hand, it seems that they’ve made much less progress on getting it to design solutions to these issues.
The same UK security research body ran the same CTF against GPT5.5. GPT5.5 got the same result as Mythos.
Anthropic promised us that Mythos was such an existential threat that it would compromise "every OS and browser on devices across the planet". They've held conferences and meetings with banks and govts across the world, shouting how critical this issue is.
GPT5.5 has been out for a month. Every device on earth has not been breached yet. It's very fair to criticize Anthropic's maximalist posturing when it's becoming exceedingly clear their models are fairly behind OpenAI's in capability.
In my opinion, the original commenter's statement stands, and the UK govt data point only helps support that due to the equal result between Mythos and GPT.
I'd advise reading into the specifics of what happened with Firefox; the TL;DR is a reduced safety version of its code was scanned by Opus 4.6 (yes Opus) and found a multitude of bugs and 4 high severity vulns that did not escape sandbox. The Mythos system card test describes running Mythos against the same issues Opus found to see if it could reliably replicate and chain together an attack.
I think for every point, we need to know how many tokens and cost were burned to achieve a desired outcome. And how buggy each software was to start.
> Maybe curl is significantly better hardened than most projects?
Meanwhile from [1]:
"Not even half-way through this #curl release cycle we are already at 11 confirmed vulnerabilities - and there are three left in the queue to assess and new reports keep arriving at a pace of more than one/day."
"The simple reason is: the (AI powered) tools are this good now. And people use these tools against curl source code.They find lots of new problems no one detected before. And none of these new ones used Mythos. Focusing on Mythos is a distraction - there are plenty of good models, and people who can figure out how to get those models and tools to find things."
Yeah, it looks like there are at least 11 security bugs missed by Mythos.
[1] https://www.linkedin.com/feed/update/urn:li:activity:7463481...