the CoT bug where 8% of training runs could see the model's own scratchpad is the scariest part to me. and of course it had to be in the agentic tasks, exactly where you need to trust what the model is "thinking"
the sandwich email story is wild too. not evil, just extremely literal. that gap between "we gave it permissions" and "we understood what it would do" feels like the whole problem in one anecdote
also the janus point landed, if you build probes to see how the model feels and immediately start deleting the inconvenient ones, you've basically told it honesty isn't safe. that seems like it compounds over time
It's scary to think that some very intelligent AI Model is not honest with us..
Ultron is not far, I guess...