> Why is requesting the model to show vulnerabilities is being blocked if fixing it not?
This is how Anthropic describes Fable's behavior:
"When Fable’s classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs."
So if you ask the model to "find security issues in this code base", it's supposed to fall down to Opus 4.8. I guess the "exploit" here is that if you just tell Fable to "fix this code", which is not "a request related to cybersecurity", it will fix security issues (as it should).
So you can then look at the diff and figure out what the vulnerabilities were.
I think this whole thing is a bit weird. It seems to me that we'd be better off if I, as someone who publishes open-source code, could ask Fable to review my code for security issues - even if that also allows attackers to do the same. Better to fix the issues than not know about them.
>So you can then look at the diff and figure out what the vulnerabilities were.
It doesn't even take reading or understanding the vulnerabilities at all.
You just ask it to write tests and the tests themselves can be copied and pasted as bonafide exploits.
I wonder if opus 4.8 would also be able to fix the code too
The problem then is that if you're not using Fable/Mythos, you are under threat. It's like having a single gun manufacturer.
On this track, we're probably destined for a monopoly breakup before too long.
> I guess the "exploit" here is that if you just tell Fable to "fix this code", which is not "a request related to cybersecurity", it will fix security issues (as it should).
The original sin is calling any bugs security bugs in the first place.
It's just unintended behavior.
If you say "should this model be able to fix unintended behavior" the answers are not alarming.
If you say "what about when those behaviors interact in unforeseen ways, allowing even crazier unintended behavior, should it be allowed to help you fix that too?"
Again, the answers are going to be clear.
Our tools must support correctness and resilience and help the exact thing humans are bad at: combinatorial explosions of subtle lacks of correctness…
…and just f'ing fix it.