So, what I'm trying to understand, and I can't find any clear information about that in the article, is how they "compiled" e.g. the Sudoku solver into a Transformer's weights. Did they do it manually? Say, they took the source of a hand-coded Sudoku solver and put it through their code-to-weight compiler, and thus compiled the code to the Transformer weights? Or did they go the Good, Old-Fashioned, Deep Learning way and train their Transformer to learn a ("100% correct"!) Sudoku solver from examples? And, if the latter, where's the details of the training? What did they train with? What did they train on? How did they train? etc etc.
Very light on details that article is.
The article states they trained a WASM interpreter and programs are represented as WASM bytecode
My interpretation is that they built a simple virtual machine directly into the weights, then compiled a WASM runtime for that machine, then compiled the solver to that runtime.