Did you read the post you are responding to? It says:
> What's the benefit? Is it speed? Where are the benchmarks? Is it that you can backprop through this computation? Do you do so?
The correct parsing of this is: "What's the benefit? [...] Is it [the benefit] that you can backprop through this computation? Do you do so?"
There are no details about training nor the (almost-certainly necessarily novel) loss function that would be needed to handle partial / imperfect outputs here, so it is extremely hard to believe any kind of gradient-based training procedure was used to determine / set weight values here.
> There are no details about training
my understanding was that they are not training at all, which would explain that. they are compiling an interpreter down to a VM that has the shape of a transformer.
ie they are calculating the transformer weights needed to execute the operations of the machine they are generating code for.