Are the training arenas for the Activation Verbalizer and Activation Reconstructor models well described here?
If they are co-trained only on activationWeights->readibleText->activationWeights without visibility into the actual stream of text that the probe-target LLM is processessing, then it seems unlikely that the derived text can both be on-topic and also unrelated to the "actual thoughts" in the activationWeights.
The verbalizer and reconstruction models are both initially finetuned on LLM output from a summarization prompt. The resulting text is not completely unrelated, but mostly wrong: https://transformer-circuits.pub/2026/nla/png/img_18fcfc16e9... The reconstructed activations are also far from matching the verbalizer's input. It's not unusual in machine learning to have results that are shit and SOTA at the same time, simply because there's no other technique that works better.