logoalt Hacker News

traceroute66yesterday at 6:41 PM1 replyview on HN

> Uh, so those charts don’t look… particularly impressive at all to anyone else?

I suspect Anthropic gave them early access hoping for a marketing win and ended up with their arse being served to them on a plate.

All rather predictable really. As you say "more compute needed" as the default answer from the AI companies is completely unsustainable.

As for the value of Anthropic blog posts, well...


Replies

refulgentisyesterday at 6:51 PM

The CTF charts are the less interesting result. (article: "Even expert-level CTFs only test specific skills in isolation.") Models converging at non-expert level isn't a knock on Mythos, it's the benchmark saturating. Of course GPT-5 matches it there.

The actual result is TLO, and "only 6 more steps" in OP misreads how sequential attack chains work. These aren't independent puzzles. Each step gates the next. Averaging 22 vs 16 means Mythos is consistently punching through bottlenecks that completely stop Opus 4.6. More importantly: Mythos completed the full chain 3/10 times. Opus 4.6 completed it 0/10 times. That's not a narrow margin. In any security-relevant framing, "achieves full network takeover" vs "does not achieve full network takeover" is a binary threshold, and exactly one model crossed it. A year ago the best models struggled with beginner CTFs. Now one autonomously replicates what AISI estimates takes human professionals 20 hours. Calling that unimpressive because the margin over second place is single digits is measuring the wrong gap.

re: compute, "requires lots of compute" and "scaling is a dead end" are near-opposite claims. If performance is still climbing at 100M tokens with no visible plateau, that's evidence scaling works. Whether it's cheap today is a different question, and not one that ages well. Compute costs fall reliably, so what matters is the capability at a given price point in 18 months, not today.

show 2 replies