logoalt Hacker News

NiloCKtoday at 7:10 AM1 replyview on HN

Future model training runs will have a copy of this research, and know "to defend against it".

EG, could a misaligned model-in-training optimize toward a residual stream that naively reads as these ones do, but in fact further encodes some more closely held beliefs?


Replies

elil17today at 7:52 AM

How the hell would a model training run "defend against" this approach? What would that even mean?