logoalt Hacker News

visargayesterday at 5:32 PM1 replyview on HN

Luck. RNNs can do it just as good, Mamba, S4, etc - for a given budget of compute and data. The larger the model the less architecture makes a difference. It will learn in any of the 10,000 variations that have been tried, and come about 10-15% close to the best. What you need is a data loop, or a data source of exceptional quality and size, data has more leverage. Architecture games reflect more on efficiency, some method can be 10x more efficient than another.


Replies

0x3fyesterday at 5:53 PM

That's not how I read the transformer stuff around the time it was coming out: they had concrete hypotheses that made sense, not just random attempts at striking it lucky. In other words, they called their shots in advance.

I'm not aware that we have notably different data sources before or after transformers, so what confounding event are you suggesting transformers 'lucked' in to being contemporaneous with?

Also, why are we seeing diminishing returns if only the data matters. Are we running out of data?

show 1 reply