The model is just trying to map from sequence to next token. You could say that it doesn't real...

HarHarVeryFunny • today at 3:01 PM • 0 replies • view on HN

The model is just trying to map from sequence to next token. You could say that it doesn't really care about the relationships between words/tokens - it is just being trained to learn the best attention/etc weights to make this mapping as accurate as possible.

The model could just as well learn to predict next token from gibberish text as long as there were some statistical gibberish regularities to learn. However, if you train it on real meaningful text then the statistical regularities it needs to learn (and will, thanks to gradient descent, and the capable architecture) will be those reflecting "token relationships" - grammar, semantics, etc.

So, you can say the "token relationships" (incl word meanings) are reflected in the statistical regularities of the training data, and the model architecture and training algorithm are just very capable of learning those regularities whatever they may be.

You can consider it related to Word2Vec word embeddings, which are based on the idea that the meaning of words comes from how they are used, which to a first approximation can be implemented by considering the meaning of words to be defined by the words they appear next to(!), which is what the Word2Vec embedding training algorithm does, and famous examples such as "(king - man) + woman = queen" prove that this is in fact learning the meanings of words.

alt Hacker News