logoalt Hacker News

gck1yesterday at 8:50 AM1 replyview on HN

Yes, that's true. Excluding Fable, OAI models are the most refusal heavy. However, I'd rather get a refusal than response with poisoned output.

Since currently there's no way to verify if poisoning happened or not, I don't trust Anthropic anymore, regardless of what they say.

But my trust towards OAI is also brittle - what if they also do it, or start doing it?

I want to have a verifiable way to know that the prompt I sent was the prompt the model received. I want to know if anything was injected as well - I understand they may not necessarily be able to reveal the exact steering, but at least give me the steering category and its hash or something.


Replies

dannywyesterday at 9:59 AM

What kind of work are you getting refusals on? Genuinely curious. The only refusal I’ve had in recent memory was declining to find doorbell camera footage matching a certain description, which is fair enough and I think EU laws heavily restrict such activities (even tho I’m not in the EU)

show 2 replies