If you set aside political menace, this is a huge problem with Anthropic's strategy. You _can...

martinald • today at 11:07 AM • 9 replies • view on HN

If you set aside political menace, this is a huge problem with Anthropic's strategy.

You _cannot_ say that Mythos is super dangerous and can only be rolled out to certain people, but then release Fable with anything other than bulletproof cyber denials.

Clearly with LLMs, bulletproof denials are ~impossible due to the way LLMs work.

So you've ended up in a situation where Anthropic are simultaneously claiming it's a incredibly dangerous model _and_ there are (minor, potentially) problems with the security "protections".

As technical people we understand that nothing can be perfect, esp in LLM world. But all my non technical friends were really confused how they had managed to make the model "safe" so quickly when it was released and the general sentiment was it shouldn't have been released - and now to an outsider I think it looks like it was never safe at all to release, so I can totally see how the current US administration have got themselves very upset with it.

_Even if_ there was no political bad will, it's a bit of a silly scenario to end up in, and really quite easily foreseen.

Replies

pjc50 • today at 12:03 PM

> Clearly with LLMs, bulletproof denials are ~impossible due to the way LLMs work

Exactly. AI safety is nonsensical. You cannot define the set of "bad strings". The billion monkeys with typewriters are eventually going to be able to produce them. Any "safety" system for constraining LLM output is going to have a nonzero leak rate.

But on the other hand, this is also irrelevant, unless you're irresponsible enough to connect an LLM to something that actually matters.

Yes, it's going to alarmingly accelerate vulnerability finding. But, as we know from decades of security research, that's a three way problem already between the devs, the black hats, and the white hats.

Let's not pretend the strategy of "the US will always have a technological advantage and veto over China" will work either.

➕ show 7 replies

amalcon • today at 1:23 PM

I do find it hilarious that Asimov wrote many stories about how simple bright-line rule-based systems are ineffective for restricting agency. Those stories were first published in the 1940s.

80 years later, we have something approximating AI, and we're trying to restrict it with simple bright-line rules. Not because we never learned that lesson, but because we simply haven't come up with a better way to do it. Probably because a better way to do it just doesn't exist.

The hilarious part, though, is that it's not the AI that's working around the rules. That's the scenario that's been in science fiction, but it's not what's happening. It's the human users making use of our agency to get the AI agents to work around the rules. Despite calling them "agents", current AI agents don't seem to be able to that particular something. Yet, at least.

➕ show 2 replies

cge • today at 12:11 PM

> Clearly with LLMs, bulletproof denials are ~impossible due to the way LLMs work.

As a scientist who repeatedly ran into the classifier-based denials: it appears Anthropic’s strategy to make denials more robust, at the cost of many false positives, was to have a separate classifier processing both input and output tokens, at an extremely simple, almost keyword-search level. One weakness of this approach is that it only catches things that use the right keywords: it is in some sense weak exactly where an LLM-based classifier would be stronger.

Work on abstract, closer-to-CS algorithms that used chemistry terminology were blocked immediately, while work directly relevant to chemistry/biology experiments, writing code to process images from a very specific microscopy setup relevant primarily to biological samples, was never blocked at all, because it happened to never use relevant keywords.

That’s consistent with this situation: finding and fixing bugs in the context of looking for bugs perhaps happened to never use words like ‘exploit’ or ‘cybersecurity’.

➕ show 2 replies

ceejayoz • today at 11:20 AM

> it shouldn't have been released

The genie is out of the bottle either way.

Unless we believe Anthropic has a wizard or superhero secreted away that no one else can replicate.

➕ show 1 reply

wrsh07 • today at 1:18 PM

While I agree that anthropic has several communication and PR problems, it doesn't seem like Fable has been shown to offer any advantage here (for cyber offensive capabilities) over the previous state of the art.

I'm not saying all of Anthropic's statements are true, but mythos did seem to find many legitimate security exploits. You should be able to talk about a helpful-only model being released to limited partners while still releasing a very locked down model that doesn't advance the state of the art on these things, and that seems to be what they did.

There's no inherent contradiction to that.

embedding-shape • today at 2:57 PM

> So you've ended up in a situation where Anthropic are simultaneously claiming it's a incredibly dangerous model _and_ there are (minor, potentially) problems with the security "protections".

They probably say it worked for OpenAI with earlier versions of ChatGPT and GPT, and figured can't hurt to try an similar approach and see what happens.

giancarlostoro • today at 2:34 PM

Yeah, if Anthropic didn't spend the last what? Month? Month plus telling us how dangerous it was, I would be more upset, but they told us how dangerous it was, and they also said they would scour all your prompting / data (??) if you used it, I noped out of that one. Opus does everything I need it to, even if it takes me "longer" or I have to compact and feed it more context, that's fine by me. Still saves me weeks of effort.

piokoch • today at 1:06 PM

If it weren't for the IPO, Anthropic would just ship another model, called Opus 4.898, people would run another "duck on the bicycle" test that would be slightly better than the one from previous version 4.897 and move on.

But we have IPO coming, hence we face that big drama about model that would enable Iran to produce nukes, ok, that card was played, so maybe Taliban producing some magic poison to kill all Americans or some really bad people (Venezuelans?, Cubans? Somalian football referees?) to break into Github and make Github Actions working even worst (if this is even possible).

0xbadcafebee • today at 2:24 PM

It's not Anthropic's strategy, it's OpenAI's strategy. The first time OpenAI said its model was "too dangerous to release" was February 2019.

"Our model, called GPT‑2 (a successor to GPT ), was trained simply to predict the next word in 40GB of Internet text. Due to our concerns about malicious applications of the technology, we are not releasing the trained model." - https://openai.com/index/better-language-models/

They continue to say the same thing every year. Last time was 2 months ago (https://www.techbrew.com/stories/2026/04/15/calculated-risks...).

alt Hacker News

Replies