If this decomposition actually holds, it's the first model where you could show a regulator why...

7777777phil • today at 8:21 AM • 2 replies • view on HN

If this decomposition actually holds, it's the first model where you could show a regulator why it produced a given output.

Replies

yorwba • today at 9:14 AM

I doubt that a regulator would be satisfied by the kinds of explanations this provides and the interventions it enables.

Suppose somebody put an LLM in charge of an industrial control system and it increased the temperature so much that it caused an accident. The input feature attribution analysis shows that the model was strongly influenced by the tokens describing the temperature control mechanism, concept attribution shows that it output tokens related to temperature, industrial processes and LLM tool-call syntax.

The operator proposes to fix this by rewriting the description and downweighting the temperature concept in the output, and a simulation shows that with these changes the model doesn't make the same decisions in this situation anymore. Should the regulator accept this explanation as sufficient to establish that the system is now safe?

If the controller has just a few parameters and responds approximately linearly to changes in its inputs, you can in principle guarantee that it'll stay within a safe zone. But LLMs have a huge number of parameters and by design highly nonlinear behavior. A simple explanation is unlikely to reflect model behavior accurately enough that you can trust its predictions to hold in arbitrary situations.

adebayoj • today at 8:33 AM

It does :) We constrained the model to do exactly this during training: https://www.guidelabs.ai/post/scaling-interpretable-models-8....

➕ show 1 reply

alt Hacker News

Replies