Yes, and it works in theory. Less so in practice. You saturate the memory of a b200 with a few doz...

noosphr • today at 12:09 AM • 0 replies • view on HN

Yes, and it works in theory.

Less so in practice. You saturate the memory of a b200 with a few dozen tokens on attentions higher than order 4. Training is even worse.

To paraphrase Knuth: high order polynomials are much more unimaginably large than mere infinity.

alt Hacker News