Future model training runs will have a copy of this research, and know "to defend against it".
EG, could a misaligned model-in-training optimize toward a residual stream that naively reads as these ones do, but in fact further encodes some more closely held beliefs?
How the hell would a model training run "defend against" this approach? What would that even mean?