logoalt Hacker News

SPascareli13today at 4:27 PM1 replyview on HN

If the model is trained to be a interpreter, then that means that the loss should reach 0 for it to be fully trained?

Also, if it's execution is purely deterministic, you probably don't need non linearity in the layers, right?


Replies

D-Machinetoday at 7:29 PM

The model isn't trained, it isn't differentiable (read carefully to the end: they say their model might still work if they made it differentiable, but they don't know), and it isn't clear IMO it could ever be made trainable (what is your loss function that scores a "partially correct" program / compiler, and how are you getting such training data?).

You need non-linearity in self-attention because it encodes feature / embedding similarities / correlations (e.g. self-attention is kernel smoothing) and/or multiplicative interactions, it has nothing to do with determinism/indeterminism. Also, LLMs are not really nondeterministic in any serious way, that all just comes from tweaks and optimizations that are not at all core to the architecture.