I initially agreed with a lot of the sentiment that asks "why," but have reframed my opinion. Instead of seeing this as a way to run programs via inference, I'm now seeing this as a way to bootstrap training. Think about the task of classification. If I have an expert system that classifies correctly 80% of the time, now I can embed it into a model and train the model to try to raise the success rate. The lower we can make the cost of training on various tasks, the better it levels the playing field of who can compete in the AI landscape.
Why would that be desirable?
If we take the human brain as an example, it's pretty bad at computation. Multiply two 10-digit numbers takes forever, despite the enormous size of its neural network. It's not the right tool for the job - a few deterministic logic gates could do that much more efficiently. That same circuit can't do much else, but multiplying, oh boy, it's good at that! Why do we think that artificial neural nets would be the right tool for that job? What's wrong with letting the LLM reach out to an ALU to do the calculation, just like a human would do? It's surely going to be quicker and require less energy.
the paper is burying the lede here (i think?)
> The key technical unlock is to restrict lookup heads to head dimension 2, which enables a decoding path where the dominant retrieval/update operations can be computed in log time in the sequence length (for this structured executor regime), rather than by a full prefix-sized attention sweep.
edit: i understand how hullkv works now. very clever.
I dont understand why this strategy is applicable only to "code tokens"
lastly, im not sure why wasm is a good target, iirc wasm seems to be really inefficient (not so much in code but in expressivity). i wonder if that curtails the llms ability to plan higher order stuff (since its always forced to think in the small)
So, what I'm trying to understand, and I can't find any clear information about that in the article, is how they "compiled" e.g. the Sudoku solver into a Transformer's weights. Did they do it manually? Say, they took the source of a hand-coded Sudoku solver and put it through their code-to-weight compiler, and thus compiled the code to the Transformer weights? Or did they go the Good, Old-Fashioned, Deep Learning way and train their Transformer to learn a ("100% correct"!) Sudoku solver from examples? And, if the latter, where's the details of the training? What did they train with? What did they train on? How did they train? etc etc.
Very light on details that article is.
This seems a really interesting path for interpretability, specially if a big chunk of a model's behavior occurs pseudo-symbolically. This is an idea I had thought about, integrating tools into the main computation path of a model, but I never imagined that it could be done efficiently with just a vanilla transformer.
Truly, attention is all you need (I guess).
Early thoughts - this is very interesting and quite possibly revolutionary. If they have legitimately emulated a computer with memory reliably inside a transformer - that will open up an entirely new world for research.
I don’t want to say too much too soon, but I am pretty excited about this.
This shows the downside of using AI to write up your project. I see the eloquent sentences, but don't get the message.
> This works, but the actual execution happened outside the model. The model specified the computation, then waited for an external system to carry it out. > Our transformer also emits a program, but instead of pausing for an external tool, it executes that program itself, step by step, within the same transformer.
What's the benefit? Is it speed? Where are the benchmarks? Is it that you can backprop through this computation? Do you do so?
Why is it good that it's "inside" the model? Just making it more elegant and nice? The tool was already "inside" the overall hybrid system. What's the actual problem?
I'd like to see this combined with reinforcement learning to optimize models to think computationally. Generating ideas with hypothetical results and then running them in the same thought. Their solution sounded like a lot of tokens though.
This seems like it has some potential, but is pretty much useless as it is.
Shame there are no weights released - let alone the "compiler" tool they used to actually synthesize computational primitives into model weights. It seems like a "small model" system that's amenable to low budget experiments, and I would love to see what this approach can be pushed towards.
I disagree with the core premise, it's basically the old neurosymbolic garbage restated, but embedding predefined computational primitives into LLMs could have some uses nonetheless.
Interesting... But why? What is the benefit, other than increasing our understanding of model architectures?
Our brains can also simulate turing machines, slowly. We automated that with computers that are faster and more reliable. So why not allow a model to use external much faster and reliable tools, just as we do?
> the whole process remains differentiable: we can even propagate gradients through the computation itself. That makes this fundamentally different from an external tool. It becomes a trainable computational substrate that can be integrated directly into a larger model.
IMHO the key point at which this technique has an unfair advantage vs a traditional interpreter is here.
How disruptive is it to have differentiability? To me it would mean that some tweaking-around can happen in an LLM-program at train-time; like changing a constant, or switching from a function call to another function. Can we gradient-descent effectively inside this huge space? How different is it from tool-calling from a pool of learned programs (think github but for LLM programs written in classic languages)?
It makes sense that a next token predictor could execute assembly code. This is fascinating work, especially with the memory implementation.
I really liked the article, but food for thought: is a transformer that offloads computation to python really that different from Python code being read and then executed by a compiler?
Both examples are of a system we created to abstract most of the hard work.
I think a more important concept here is that the term "AI" has a lot of built-in assumptions, one of which being that it is (or will be) super intelligent, and so folks like the author here think (correctly) that it's important for the AI to be actually doing the work itself.
If the model is trained to be a interpreter, then that means that the loss should reach 0 for it to be fully trained?
Also, if it's execution is purely deterministic, you probably don't need non linearity in the layers, right?
I couldn't tell from the article whether this works as a language model or not. Can it read and write English or is it just a weird program interpreter? If it switches between modes, how do they interact?
This is brilliant, game changing level.
Hey, give it also access to the dump of its weights and way to propose updates so it can see and tinker its brain directly.
Is their convex hull attention mechanism new and generally useable? I mean, it substantially restricts the shape of the model, so it isn’t a universal solution of course, but it does seem to overcome a pretty annoying limitation.
one of the most interesting pieces I've read recently. Not sure I agree with all the statements there (e.g. without execution the system has no comprehension) - but extremely cool
This has a lot of potential. Especially if the compiled "code" can be efficiently shared between models of the same architecture. That would easily overshadow LoRa and finetuning in general.
this is neat but to me seems like the circuitous path to just skipping autoregression, whereas the direct path is to just not do autoregression. get your answers from the one forward pass, and instead of backprop just do lookups and updates as the same operation.
if you understood the article, please correct my understanding -
they created a new training dataset which also has computation solving step by step (multiplying two numbers or playing sudoku) and then trained a transformer on it- as a result, the model performs the computation(multiplying two numbers) "inside" itself instead of calling calculator (or python)?
++ And they also figured out how to make attention faster?
LLMs are not deterministic per my understanding. A program always produces the same output for the same input and instructions (ignore FP accuracy for now). How is determinism achieved here?
If this works, we might be able to have a special ISA for LLM and forget about high level computer language for humans.
I am talking strictly about computing, not garbage in garbage out IO.
The Percepta stuff would seem to demonstrate a mechanism for implementing "thinking". I don't understand how foundation models implement "thinking", but my intuition is that models are specifically trained for matching on and following procedural patterns. A task in a given domain can be performed through an associated and encoded procedure. The model holds all the linkages, as weights, that allows a procedure to be conditionally incrementally generated and performed. Does anyone have any insights about how LLM "thinking" is trained and coded?
[flagged]
Besides being a very interesting conceptual exercise, the animated figures in this article are absolutely stunning - best I’ve ever seen.
Is this genius? Or just a new binary executable format? Can't tell.
This sounds so cool but I can’t tell if it’s a practical joke, even after sitting on it for 2-3 hours. Key points where I lose understanding/trust are when a WASM interpreter suddenly appears in the model, and when we’re representing code in weights.
It is unclear to me how this WASM interpreter is / could be deterministic.
Is it possible to do the inverse, then? (Tranforming weights back to code)
very cool idea. But, time savings are not true for every tool call, and it's not clear to me yet whether this is batch-able; also, intuitively, for most of the models that run on GPU, you'd still want to offload tool exec part to CPU since it's much cheaper...
This is really important work.
The original title is "Can LLMs be computers?"
But the right question is, should they?
This looks like a hack. Yes, being able to interpret webassembly is a general oracle. Still falls short of solving the real problem directly.
big question is how efficient is this compare to executing assembly on CPU
I love how this paper describes what actually happens and what the current tradeoffs are.
That having been said, many LLMs are being run on SIMD GPUs, in warps, basically they are just doing a lot of vector multiplications, activation functions and kv self attention (the expendive step).
The issue is we want the LLMs to be one-way through the layers, whereas turing-complete programming languages support loops and no well-defined stopping time. You can stick a simple computer into an LLM, but it won’t be able to do long loops.
However, for these specific workloads, the need to attend only to the latest state is indeed a huge optimization! Gone is the need for n^2 complexity that dominates the cost, now it is (log n)^2 attention which is far smaller.
Very interesting read. Would love to learn more about incorporating deterministic calculations where it's normally non-deterministic.
ooh
what!
[dead]
This seems way cooler than just computation (which is easy to hand off to a tool, and arguably more predictable that way). The broader point here is that you can have your model switch dynamically to/from a kind of attention that scales with the log of the token count, by only exploring the convex hull in a 2D space. A less capable version of attention, to be sure, but one capable of tracing a program’s execution with text representations of registers and stack - which is a meaningful level of flexibility, and one many humans would find difficult to do reliably!
What could you do with an LLM that can go into “focus mode” and generate tokens extremely rapidly? How much more powerful would a reasoning-token-generation phase be that can explore and cull large numbers of paths/hypotheses, so long as they are well defined? Does this have implications for multi-modal models and spatial reasoning?
As the paper suggests:
> These models could be useful in several modes: as a dedicated fast path paired with a slower, more general model; as part of a fast/slow hybrid architecture inside a single system; or as a speculative execution model that proposes tokens quickly while a regular-attention model verifies and accepts them. Regardless of their eventual capability ceiling, they already suggest a powerful systems primitive for speeding up larger models.