I don't like how most LLM explainer articles and videos say that essentially a LLM " predicts the next word".
I'm a developer but not very good at maths and I still don't understand any of it.
A LLM clearly has some "visual" capacity. You ask Gemini to build something with Canvas and it's able to reason about the shape of things. Like recently I waanted a checkbox that has like a gradient flowing around the edge. It figured out it could use a radial gradient from the center of the checkbox, and overlay that with a small inner div so you only see the edge that looks like the gradient is circling around the checkbox.
How is that "predicting the next word"?
Not saying AI is intelligent or conscious or anything like that, but the algorithm clearly is far more complex than "predicting words".
What I mean, is the LLM is able to represent things in space . That part I don't understand.
I also still dont understand the relationship between the chat based LLM and the multi modal stuff. I think I read somewhere when image is generated it is also tokens?
Your casual understanding is imprecise.
At all times the LLM is, indeed, predicting the next token. Anything it does emerges from that.
It did not "figure anything out". It predicted that text describing the use of a radial gradient was likely to follow text describing your problem.
> What I mean, is the LLM is able to represent things in space . That part I don't understand.
Why do you think this is mutually exclusive to "LLM predicts the next token"?
If you tell someone from 19th century that bytes (just 0s and 1s!) can represent an opera, a song, or even a whole interactive experience, they might be really confused. But there is no reason they can't.
If you tell someone without math background that the sums of smaller and smaller sin waves can represent pretty much anything in our universe, they might be really confused. But there is no reason they can't.
There is simply no reason that a next-token predicator can't generate a nice-looking checkbox.
I understand that to be the "emergent abilities" which are spoken about. There are correlations in the dataset that are strong enough for it to seem to have an understanding which wasn't obvious it would have from simply "predicting the next word".
It's still predicting the next word. Somewhere in the gigantic dataset that the LLM was trained on, there is a phrase that says "gradient border" being in the vicinity of a CSS code that render the stuff. Therefore when you run it on an inference loop there's a good chance it output that CSS code when you tell it to render a "gradient border"
Multi-modal models that can understand visual input do exists, but no such visual reasoning process happened in the example you mentioned. Not unless you have a visual feedback loop in the coding harness.
I'm not dismissing the capability of "predicting the next word" however. The vast amount of training data enable extremely complex and useful behavior you just described.
LLMs fundamentally work by predicting the next word (token). But that should not be used to diminish their potential capabilities. It's like saying that human brains "just predict (or produce) the next electrical impulse". Fundamentally correct, but says nothing about the potential emergent capabilities of scaled-up systems that work like that.
Emergent properties of complex systems should not be diminished just because the underlying operating principle is simple.
>is the LLM is able to represent things in space
It is imitating the text written by humans who can represent things in space.
Sorry you're being downvoted for asking a very reasonable question. I don't think any of the replies here address your question either.
If I can do my best to answer, Gemini is a multi-modal system. That means it's trained not only on text but also still images, video and also sound. The training happens in parallel and the representation of each modality is usually different, so the image recognition part is not trained on text tokens but pixels, the video part (probably) on video frames etc. There is some kind of integrated training that goes on so that text can be generated that is correlated to an image and so on, but I don't know the specifics about Gemini in particular. This kind of thing is not exactly new either, you can find systems that captioned images before the rise of LLMs simply by training on examples of images coupled to their textual descriptions.
In that sense it's not entirely correct to call Gemini an "LLM" because it's not only a "language" (or, more precisely, text) model. But LLM I guess becomes a bit of a shorthand for everything based on, or combined with, an LLM.
Anyway that's what's going on: it's not just predicting the next word. It's also predicting the next image frame or the next set of pixels etc associated with the next word.
It can’t. It’s like a Redditor, it just repeats what it has seen other people say.
It has read all of stackoverflow, so it has seen your kind of problem before. Try asking it something really unusual and it will shit the bed.
I don't want to pretend I can explain LLMs, but the same "math" can be applied for visual and non visual things. The dot product of two vectors gives you the angle between them. This is true in 2 or 3 dimensions. But it's also true in 4, 5, 6...n dimensions even though we cannot visualize a 4d space. That it's an angle is relevant for you in the space you can comprehend, but for math or a machine it works in any number of dimensions. So it does need to understand anything visually if the math checks out.
LLMs are modelled to predict the next token, and are indeed trained to do so on enormous bodies of text. But to be really good at predicting the next token (word) at the end of a long string of text, you must understand what the text means. If I give you the entire text of a long novel and at the end ask you a single "yes/ no" question about the plot, you only need to emit a single token, but emitting the correct one implies having understood the plot of the novel. This is what LLMs do. They're generating meaningful, coherent text, which implies understanding and cognition at a level that is much deeper than that of the single token they generate at each forward pass. Internally, the LLM has learned to represent the meaning of the entire prompt text, the concepts it implies and its possible continuations far beyond the horizon of simply outputting the next token.
I do agree bigly. Calling what is basically a superhuman brain inside a computer just a "token predictor" is peak thinkslop.
Predicting a word is the final objective, as in the output of the model is a probability distribution of the next token. However, choosing the right token is more complicated than just regurgitating the training data (and you won't encounter an exact example in the training data, so you need to interpolate). This makes the model learn abstract representation of things that it is able to manipulate before outputting this back into token. RL also complicates this because the "fitness" is now some arbitrary metric computed over an entire sequence of tokens.