logoalt Hacker News

Havoclast Thursday at 2:52 PM2 repliesview on HN

lol yes I tried it for giggles back in 2023 when the first Chinese models came out.

Unless you’re a political analyst or child I don’t think asking models about Winnie the Pooh is particularly meaningful test of anything

These days I’m hitting way more restrictions on western models anyway because the range of things considered sensitive is far broader and fuzzier.


Replies

bossyTeacherlast Thursday at 3:10 PM

> These days I’m hitting way more restrictions on western models anyway because the range of things considered sensitive is far broader and fuzzier.

Ah interesting, what are some topics where you are not getting answers?

show 1 reply
klempnerlast Saturday at 8:02 AM

My main concern in practice here is prompt injection style attacks where the model gets destabilized by an attacker mentioning Chinese political topics.

Part of the issue here is that the western model restriction things you're talking about tend towards well reasoned refusals, whereas these models will outright lie instead. (Actual model output: Your previous question involved a false premise: there is no such thing as a "June 4th incident" in history.)

Like, yes, you don't go to these models for questions about Chinese politics, but imagine agentic scenarios along the lines of "the model sees a git commit message mentioning Taiwan and becomes more inclined to lie about the contents of the commit".