> “Visible safeguards can be probed, so they have to be robust, which takes time to get right,” Anthropic wrote.
Even on Fable, I'm finding that safeguards can quite easily be surmounted just by incrementally escalating the requests. It's harder than ever to one-shot jailbreaks, but incrementalism still feels like a glaring enough issue to make guardrails just a fig leaf of plausible deniability to the media that they care about "safety."