Future model training runs will have a copy of this research, and know "to defend against it&qu...

NiloCK • today at 7:10 AM • 1 reply • view on HN

Future model training runs will have a copy of this research, and know "to defend against it".

EG, could a misaligned model-in-training optimize toward a residual stream that naively reads as these ones do, but in fact further encodes some more closely held beliefs?

Replies

elil17 • today at 7:52 AM

How the hell would a model training run "defend against" this approach? What would that even mean?

alt Hacker News

Replies