Anthropic injects text into the conversation triggered by certain conversation topics. This happened...

andy99 • today at 5:10 PM • 1 reply • view on HN

Anthropic injects text into the conversation triggered by certain conversation topics. This happened to me in relation to some red-teaming related discussion that was adjacent to something “sensitive”, I think sex, and Claude got confused about why I had said some kind of warning and mentioned it it’s response. After a back and forth it was clear that some extra warning to answer but avoid anything inappropriate had been inserted into the conversation.

Replies

wongarsu • today at 5:50 PM

Claude also sometimes mentions getting messages from classifiers, probably related to auto mode. Amusingly enough, when this happens to a subagent/fork, the orchestrator will call these " hallucinations by the subagent"

alt Hacker News

Replies