This seems really interesting. While Anthropic tried to use dictionary learning over an existing mod...

andy12_ • today at 9:53 AM • 1 reply • view on HN

This seems really interesting. While Anthropic tried to use dictionary learning over an existing model to try to extract concepts, this almost feels like training the model alongside the dictionary itself (or rather, the model and the dictionary are intertwined).

Replies

adebayoj • today at 10:00 AM

You are exactly right, it is guiding the model, during training, with concepts and the dictionary. This is important because dictionary learning for interpretability (post hoc) is not currently reliable: https://www.arxiv.org/abs/2602.14111

alt Hacker News

Replies