> the model has its own emergent guardrails that sometimes cause it to push back on legitimate se...

meander_water • yesterday at 9:25 PM • 1 reply • view on HN

> the model has its own emergent guardrails that sometimes cause it to push back on legitimate security research requests. But as we found, these organic refusals aren’t consistent - the same task, framed differently or presented in a different context, could produce completely different outcomes as illustrated in the examples below.

This was new. I'm surprised that a model specifically designed for security research and gated to professionals is refusing legitimate requests

Replies

_alternator_ • yesterday at 10:40 PM

There's pretty strong evidence that (mis)alignment in one area creates (mis)alignment in others. The "aligned behavior" vectors are not orthogonal from cybersecurity to bioweapons to prejudice, so having alignment in some will likely bleed into others.

alt Hacker News

Replies