Most interpretability methods fail for LLMs because they try to explain outputs without modeling the...

gormen • today at 5:09 AM • 1 reply • view on HN

Most interpretability methods fail for LLMs because they try to explain outputs without modeling the intent, constraints, or internal structure that produced them. Token‑level attribution is useful, but without a framework for how the model reasons, you’re still explaining shadows on the wall.

Replies

adebayoj • today at 7:56 AM

op here, I mostly agree with your comment! However, our model does more than this. For any chunk the model generates, it can answer: which concept, in the model's representations, was responsible for that token(s). In fact, we can answer the question: what training data caused the model to be generated too! We force this to be a constraint as part of the architecture and the loss function for our you train the model. In fact, you can get are the high level reasons for a model's answer on complex problems.

➕ show 2 replies

alt Hacker News

Replies