Building a Minimal Transformer for 10-digit Addition

Building a Minimal Transformer for 10-digit Addition

by kelseyfrog

7777777phil

15h

This lines up with something I keep coming back to. Sara Hooker's research shows compact models now outperform massive predecessors on many tasks, and scaling laws only reliably predict pre-training loss, not downstream performance. A minimal transformer learning 10-digit addition is a neat data point for that thesis. I wrote about the broader implications (2)

The trillion-dollar scaling bet looks increasingly like it's hitting diminishing returns.

(1) https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5877662

(2) https://philippdubach.com/posts/the-most-expensive-assumptio...

pankajdoharey

Looks like a Tiny Analytic transformer, RNN is arguably a better choice if you are gonna handwire an architecture to mechanically do addition. Learning is about discovering the patterns and algorithm from data. Wiring a machine to follow a procedure defeats that purpose.

dnautics

21h

it proves that the algorithm is embeddable in a bigger transformer of ~similar architecture.

wizzwizz4

24h

WalterGR

20h

https://news.ycombinator.com/item?id=47170030

Smallest transformer that can add two 10-digit numbers (github.com/anadim)

230 points | by ks2048 | 1 day ago | 97 comments

wizzwizz4

24h

I somewhat feel that using floating point arithmetic for what should be a symbol manipulation exercise is cheating. The deserialisation technique is interesting enough that I'm not really upset, though.

> The codex solution reversed the order which makes sense for making carry logic easy, but it is less clean.

That's the approach I'd have gone with. I've long been an advocate of little-endian numerical representations. That said, if there's a maximum number of digits, it's straightforward to implement the circuitry needed to do calculate the most-significant digit of the result in one go; and I somehow doubt the AI-generated solution really took advantage of the tricks that little-endian allows.

> At some point I set claude code on some debugging to my surprise I don’t recall it actually solving any of the bugs, it seemed much more concerned with “correcting” the funky things I was intentionally doing.

It baffles me that somebody capable of this kind of work would find this surprising. The process that allows LLMs to find bugs in code is the same process that entreats them to "correct" such creativity: their understanding of the world begins and ends at statistical plausibility, and they cannot truly comprehend things (though they can do a very good job of pretending, given sufficient training data).

alexlitz

14h

> It baffles me that somebody capable of this kind of work would find this surprising.

I should be clear I was not surprised that: 1) It struggled particularly hard with this sort of novel task 2) It tried to "correct" funky things that I was doing 3) It did not fix all my bugs I was surprised that it IIRC did not find even one of the bugs which were relatively simple, and the persistence of how much it got stuck with #2.

lacunary

23h

What's the difference between comprehending and understanding in this context?

wizzwizz4

11h

"Understanding" is a metaphor, used to describe an upper bound on model capability without excess verbiage. "Comprehending" includes the ability to appropriately manipulate the concepts when they're taken out of their ordinary framing context, which in principle a transformer model should be able to mimic a lot better than the systems we have; but in practice the training processes we're using do not teach the models to do this.

dnautics

21h

symbol manipulation in transformers is fp arithmetic?

wizzwizz4

11h

That's an implementation detail. The behaviour of trained transformer models remains similar even if you quantise them to 4-bit floats, or make every floating point operation noisy. This model only works if you use double-precision floating point.

Crafted by Rajat

Source Code

hckrnws

Building a Minimal Transformer for 10-digit Addition