logoalt Hacker News

klempnerlast Saturday at 8:02 AM0 repliesview on HN

My main concern in practice here is prompt injection style attacks where the model gets destabilized by an attacker mentioning Chinese political topics.

Part of the issue here is that the western model restriction things you're talking about tend towards well reasoned refusals, whereas these models will outright lie instead. (Actual model output: Your previous question involved a false premise: there is no such thing as a "June 4th incident" in history.)

Like, yes, you don't go to these models for questions about Chinese politics, but imagine agentic scenarios along the lines of "the model sees a git commit message mentioning Taiwan and becomes more inclined to lie about the contents of the commit".