logoalt Hacker News

rao-vtoday at 12:40 AM5 repliesview on HN

This is the first approach to activation analysis that I’ve seen that seems like a plausible path to model understanding.

Unfortunately I don’t know how you ground this … it’s basically asking if you can encode activations in plausible sounding text. Of course you can! But is the plausible text actually reflective of what the model is “thinking”? How to tell?


Replies

NiloCKtoday at 7:03 AM

Are the training arenas for the Activation Verbalizer and Activation Reconstructor models well described here?

If they are co-trained only on activationWeights->readibleText->activationWeights without visibility into the actual stream of text that the probe-target LLM is processessing, then it seems unlikely that the derived text can both be on-topic and also unrelated to the "actual thoughts" in the activationWeights.

mike_hearntoday at 9:24 AM

It's asking if you can auto encode activations. The AV decodes activations to text, and the AR re-encodes them back to activations. If the decoded text is completely wrong then it's unclear how the second model would re-encode them successfully given that they're both initialized from the same LM.

show 1 reply
astrangetoday at 1:51 AM

> This is the first approach to activation analysis that I’ve seen that seems like a plausible path to model understanding.

I think an issue is that there is no permanent path to model understanding because of Goodhart's law. Models are motivated to appear aligned (well-trained) in any metric you use on them, which means that if you develop a new metric and train on it, it'll learn a way to cheat on it.

show 2 replies
lern_too_speltoday at 4:20 AM

Yeah, I don't see how this text can be trusted at all. Any invertible function from activation space to text will optimize the loss function, including text that says the complete opposite of what the activations mean.

show 1 reply