> What I mean, is the LLM is able to represent things in space . That part I don't understand.
Why do you think this is mutually exclusive to "LLM predicts the next token"?
If you tell someone from 19th century that bytes (just 0s and 1s!) can represent an opera, a song, or even a whole interactive experience, they might be really confused. But there is no reason they can't.
If you tell someone without math background that the sums of smaller and smaller sin waves can represent pretty much anything in our universe, they might be really confused. But there is no reason they can't.
There is simply no reason that a next-token predicator can't generate a nice-looking checkbox.