In the 90s you didn't have norm layers, residuals, attention, and some more.
So you're missing a lot of the building blocks that make LLMs. It's not a matter of just having the compute.
I think the attention mechanism is so simple but so revolutionary that people forget it.
Like the best leaps in thinking, once it is made, is is immediately obvious and intuitive.
I think the attention mechanism is so simple but so revolutionary that people forget it.
Like the best leaps in thinking, once it is made, is is immediately obvious and intuitive.