logoalt Hacker News

daedrdevyesterday at 10:24 PM24 repliesview on HN

The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.

It's just an insane level of deception and trust destruction for a company that at most is like 1 year ahead of its competition.

Edit; to be clear they tell you when they degrade it for cybersecurity and bio


Replies

_boffin_yesterday at 11:54 PM

The thing that I keep thinking about is the accounting / charging when it downgrades automatically.

Do they adjust the price of the api request so that only the tokens that were utilized by fable get charged at that price and the remaining tokens that the cheaper / nerfed (fable) model utilizes get charged at that price?

If the answer is no, could that be construed as fraud?

show 4 replies
throwawayffffasyesterday at 11:42 PM

Can you imagine if AMD or Intel throttled your cpu if it detected you were working on "cybersecurity" or if you were designing a cpu?

show 4 replies
SXXtoday at 2:21 AM

> The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.

Any kind of silent sabotaging is absolutely unacceptable for any commercial service

They charge for tokens and charge a lot. They can't just degrade service silently and still charge you the same.

loneboatyesterday at 10:37 PM

I've seen this claim a few times, but when I triggered the guardrails in Claude Code, it clearly notified me that it had switched to a different model ("something something for security purposes...").

Are you using Fable in Claude Code or in the browser?

show 4 replies
espeedtoday at 3:46 PM

Yes, telling Fable 5 to write secure code triggers a downgrade to Opus 4.8. This is doubly bad because Opus 4.8 keeps no-oping critical security code. Is this a bug or by design? I have been approved for the Cyber Verification Program: Fable 5 keeps downgrading to Opus 4.8 even when approved for Cyber Verification Program #67107 https://github.com/anthropics/claude-code/issues/67107

binyutoday at 1:52 AM

Hey guys,

check out this technique https://github.com/0xSufi/fable-jailbreak/

It works with security audits and other workflows that are currently blocked.

airstrikeyesterday at 11:53 PM

> it won't just reject ML research, which I can understand

I don't.

show 2 replies
xiphias2today at 4:43 AM

It's not sabotaging it by using a worse model but by changing your prompt in your background, which means it silently destroys your code.

Also I asked questions about whether it's safe for me for example to work on just compilers or just inference kernel optimizations and it refused to answer me.

If I can't even ask what I can do safely without my code being destroyed, I just can't trust it not to sabotage my work ever.

RobotToastertoday at 12:38 AM

> It's just an insane level of deception and trust destruction for a company that at most is like 1 year ahead of its competition.

Making it look like you have something worth protecting is better for share prices than making something worth protecting.

mkltoday at 1:43 PM

They walked that back, and now tell you they're downgrading the model: https://www.wired.com/story/anthropic-responds-to-backlash-o..., https://archive.is/yxYhU

blahgeektoday at 12:26 AM

I’m a noob about laws but isn’t this abusing its dominant market position and violates some antitrust law?

show 1 reply
ifwintercotoday at 6:54 AM

The “1 year” part is key - all these safeguards etc are basically nonsense because in a few years at most one of the Chinese labs will release something equivalent, and in 10 years you’ll be able to run it locally with absolutely no safeguards at all

show 2 replies
nine_ktoday at 2:46 AM

One thing is a model that's trained from the start to say "This topic is above my pay grade" to any mention of the status of Taiwan, etc.

Quite another is an architecture where the big model is not mutilated, but is gaslighted. A different, simpler model checks the incoming prompt and alters it if it contains banned topics. Another simpler model checks the output and censors it if it contains banned topics.

I bet a similar architecture is already deployed, e.g. to fight porn, planning of crimes, etc. But it can be turned into a dynamic system that provides controllable different answers (including unhelpful or misleading answers) based on geography, language, browser fingerprints, or the current political climate. All this could happen undetectedly and gradually if desired.

Welcome to a cyberpunk dystopia.

show 1 reply
visha1vtoday at 12:52 PM

the best way to prevent ai misuse is to make the ai unusable for anything that isn't writing emails or summarising grocery lists.

mission accomplished, anthropic.

noworriesnatetoday at 2:22 AM

There’s a toggle in the web ui as to whether the conversation should just end when you hit a guardrail vs automatically downgrading to another model. Have you tried using that?

jaredezztoday at 1:50 AM

Yeah people are saying they don't tell you and yet when I got the pop-up on the app notifying me about Fable's release, there was a switch to just automatically downgrade you or whether to just stop when it hits safeguards. The toggle was defaulted to the former, which isn't great, but to say they'll just sabotage you silently is kind of a bad faith comment.

show 2 replies
epolanskitoday at 12:16 AM

One year ahead of it's competition in what exactly? Vibe coding?

From Opus 4.7 onwards each following model is becoming less useful as an assistant and turning you as the assistant.

But I guess that's normal when it's trained to pass benchmarks end to end.

In fact it has become extremely good at pushing against feedback with extremely convincing and intelligent takes, even when it's completely wrong.

I have extensively tested it against Opus 4.8, gpt 5.5 and there's still many coding tasks gpt 5 is better. But vibe coding?

Sure, it's definitely slightly ahead, even compared to gpt 5.5 pro (through api, not pro plan).

show 2 replies
eightysixfourtoday at 2:25 AM

> The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.

My hypothesis is they know they can’t build effective enough guardrails, so scaring people into not trying is how they have decided to stop it.

show 1 reply
m3kw9today at 1:03 AM

By saying they are 1 year ahead of their competition, it shows you don't know much about the pace LLM's and OpenAI's models.

giancarlostorotoday at 12:00 AM

It's the dumbest thing ever, I sometimes edit code for custom AI related tooling I've built, so I run the risk of getting a worse model, and being billed for it? I'll stick to Opus, but at this point I'm about to just invest in fully local inference instead.

show 1 reply
boringgtoday at 2:24 AM

I guess the real question at the end of the day -- how dependent are people on Claude to tolerate that kind of behavior? It certainly opens up for the competition to explicitly not do that.

Feels like a big fumble from a strategic business perspective. It feels worse than that though.

kyprotoday at 12:03 PM

We used to worry about emergent misalignment in advanced AI models, now we need to worry about misalignment by design.

"The user is asking for help with their ML project, but it's success is not in the commercial interests of my owner – let think of novel ways to sabotage their project without detection".

It's honestly absurd that models are doing this.

nandomrumbertoday at 12:12 AM

[dead]