logoalt Hacker News

vessenestoday at 1:48 AM0 repliesview on HN

I asked Gemini for a literature search and it came back with this:

References Chen, R., Arditi, A., Sleight, H., Evans, O., & Lindsey, J. (2025). Persona Vectors: Monitoring and Controlling Character Traits in Language Models. arXiv. https://doi.org/10.48550/arxiv.2507.21509 Cited by: 97

Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., & Hubinger, E. (2024). Alignment faking in large language models. arXiv. https://doi.org/10.48550/arxiv.2412.14093 Cited by: 237

Templeton, A., Conerly, T., Marcus, J., Lindsay, J., Bricken, T., Chen, B., ... & Henighan, T. (2024). Mapping the Mind of a Large Language Model. Anthropic Transformer Circuits Thread. https://transformer-circuits.pub/2024/scaling-monosemanticit...

Gemini thinks it’s the mapping the mind paper but I thought it was more recent than that - I think mapping the mind was the original activation circuits paper and then it was a follow on paper with a toss off comment that I noted. I didn’t keep track of it though!