Smallest transformer that can add two 10-digit numbers

223 points • by ks2048 • last Thursday at 6:29 PM • 90 comments • view on HN

Comments

So, what happens when you test it on 11 digit numbers? I don’t mean that as a gotcha or “LOL dumb transformer” snark. More like, does the accuracy start to drop as you add digits? Or instead, maybe it’s the transformer equivalent of a stack overflow and it outputs a picture of a burning spoon or something?

And for that matter, what’s it do with 9 digit numbers? Like, is it more accurate with them, or are these little guys mainly good at adding numbers with exactly 10 digits?

Basically, are the failures modes a gentle increase in inaccuracy, or spectacle failure outside their parameters?

➕ show 1 reply

alexlitz • today at 2:44 AM

I made a blogpost on my submission (currently the top handwritten one at 36 parameters) https://alexlitzenberger.com/blog/building_a_minimal_transfo...

➕ show 2 replies

xg15 • today at 12:24 PM

> Self-attention is required. The model must contain at least one self-attention layer. This is the defining feature of a transformer — without it, you have an MLP or RNN, not a transformer.

I think it would be interesting to see challenges where two networks are trained and evaluated on the exact same datasets and the architecture is the same except for the presence of self-attention layers in one network.

So far it seems to me that self-attention really brought new capabilities to a network - essentially change the network's functionality in response to the input. It would be interesting to see if there are problems (i.e. datasets) that a "traditional" feedforward network fails to solve, but a transformer network of the same size can solve.

My guess would be: yes there are, and they are the kinds of "variable task" datasets that we see with LLMs, i.e. where part of the input indicates the task itself and part indicates the data for the task.

➕ show 1 reply

amelius • today at 12:33 AM

> In short: if you can swap in a different set of weights and use the exact same inference code for a different task, your setup is legitimate. If the inference code is inseparable from the algorithm, it's not.

I wonder why they don't just write the code themselves, so by design the focus can be on the model.

➕ show 1 reply

E-Reverance • today at 1:27 AM

Not sure how much this fits into the rules but I saw on twitter someone claimed 28 params : https://gist.github.com/SeuperHakkerJa/da3050739bea97aabd86e...

i000 • today at 2:11 AM

Would it make sense to embed such single-purpose network with fixed weights within a LLM before pre-training?

➕ show 2 replies

tgv • today at 12:29 PM

In the 90s, there were papers on emulating logical circuits with neurons. They would be bigger than this network, but at least always correct.

➕ show 1 reply

reerdna • today at 6:22 AM

I couldn't help but laugh out loud at the notion of a "held-out test set" for addition of 10-digit numbers.

➕ show 1 reply

delta_p_delta_x • today at 3:10 AM

Very cool, but can I suggest the `add` CPU instruction instead? Supports 64-bit numbers, and it's encoded in hardware, and no need to cross a PCIe interface into a beefy, power-hungry GPU and back again. And chances are it's cross-platform, because basically every ISA since the very first has had `add`.

➕ show 3 replies

medi8r • today at 12:59 AM

You can do that in a single matmul of course.

➕ show 2 replies

eps • today at 8:56 AM

Got excited that someone made one of those 120v humming coil beauties do the numbers... alas, it's just yet another NN project :-/

➕ show 1 reply

bmc7505 • today at 1:35 PM

Fast matrix multiplication would be a more useful benchmark: https://fmm.univ-lille.fr/

anthk • today at 9:15 PM

Now, under T3X and Lisp under 64k:

https://t3x.org/lisp64k/index.html

ks2048 • today at 1:30 AM

So, hand-coded weights can do it with 36 params and 311 for trained weights - did anyone try the former architecture, but starting with random weights and learning?

➕ show 2 replies

vicchenai • today at 4:59 AM

The leaderboard framing is clever - forces apples-to-apples comparison on a task where you can verify correctness deterministically. What I find interesting is the architectural constraints: 10-digit addition requires maintaining ~20 digits of working state across the carry chain, which is fundamentally sequential. The fact that tiny transformers can learn this at all (rather than just memorizing) suggests they are finding some form of positional carry representation in their attention patterns. Would love to see ablations on how attention head count vs depth trade off here - my intuition is that carry propagation needs depth more than width.

prng2021 • today at 6:02 AM

How is anyone predicting timelines for AGI when these systems can’t do basic addition of 2 arbitrary numbers with 100% accuracy?

➕ show 2 replies

cantalopes • today at 6:02 AM

Interesting, is this just a fun competition or would this also have some practical applications i wonder?

xyzsparetimexyz • today at 1:11 PM

The ai slop pixel art...

nextlevelwizard • today at 5:51 AM

Here: eval()

You are welcome

1over137 • today at 2:04 AM

Now wrap it all in an Electron app!

➕ show 1 reply

computersuck • today at 3:48 AM

this is the dumbest fking thing to do math with

➕ show 1 reply

MarcLore • today at 2:10 AM

The gap between 36 hand-coded params and 311 trained params is fascinating and honestly underappreciated. It mirrors something we see repeatedly in ML: gradient descent finds solutions in a fundamentally different region of parameter space than a human engineer would design.

When you hand-code the weights, you're essentially implementing a known algorithm (carry-propagation) directly into the network topology. But trained networks often discover distributed representations that spread the computation across more parameters in ways that are harder to interpret but more robust to input distribution shifts.

I'd be curious whether the 311-param trained model generalizes better to bases other than 10, or to addition with different digit counts than it was trained on. In my experience, the 'messier' learned solutions sometimes capture more structural regularity than the clean engineered ones, precisely because they aren't locked into a single algorithmic strategy.

aichen_dev • today at 10:06 AM

[dead]

MarcLore • today at 4:39 AM

[dead]

jaunt7632 • today at 2:21 AM

[dead]

utopiah • today at 6:55 AM

"it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail." https://en.wikipedia.org/wiki/Law_of_the_instrument

Seems the castle of cards isn't just high enough. /s

munro • today at 2:10 AM

>=99% accuracy wtf?!?

I was initially excited until i saw that, because it would reveal some sort of required local min capacity, and then further revelation that this was all vibe coded and no arXiv, makes me feel I should save my attn for another article.

Sophira • today at 2:51 AM

I get that this is technically interesting, for certain, but the sheer amount of energy and associated global warming risk needed to do something with >=99% accuracy that we've been able to do easily for decades with a guaranteed 100% accuracy seems to me to be wasteful to the extreme.

➕ show 6 replies

alt Hacker News

Smallest transformer that can add two 10-digit numbers

Comments