OpenAI has been the absolute worst about this, historically. I found myself having to change my queries because it refused to serve things it deemed insensitive.
Yes, that's true. Excluding Fable, OAI models are the most refusal heavy. However, I'd rather get a refusal than response with poisoned output.
Since currently there's no way to verify if poisoning happened or not, I don't trust Anthropic anymore, regardless of what they say.
But my trust towards OAI is also brittle - what if they also do it, or start doing it?
I want to have a verifiable way to know that the prompt I sent was the prompt the model received. I want to know if anything was injected as well - I understand they may not necessarily be able to reveal the exact steering, but at least give me the steering category and its hash or something.
Yes, that's true. Excluding Fable, OAI models are the most refusal heavy. However, I'd rather get a refusal than response with poisoned output.
Since currently there's no way to verify if poisoning happened or not, I don't trust Anthropic anymore, regardless of what they say.
But my trust towards OAI is also brittle - what if they also do it, or start doing it?
I want to have a verifiable way to know that the prompt I sent was the prompt the model received. I want to know if anything was injected as well - I understand they may not necessarily be able to reveal the exact steering, but at least give me the steering category and its hash or something.