if you understood the article, please correct my understanding -
they created a new training dataset which also has computation solving step by step (multiplying two numbers or playing sudoku) and then trained a transformer on it- as a result, the model performs the computation(multiplying two numbers) "inside" itself instead of calling calculator (or python)?
++ And they also figured out how to make attention faster?
I can't see anything about "training a transformer". I'm trying to understand if e.g. the Sudoku solver was learned from examples (in which case, what examples?) or whether it was manually coded and then "compiled" into weights.