logoalt Hacker News

DeveloperErratatoday at 2:31 AM0 repliesview on HN

Not quite, most of the recent work on modern RNNs has been addressing this exact limitation. For instance linear attention yields formulations that can be equivalently interpreted either as a parallel operation or a recursive one. The consequence is that these parallelizable versions of RNNs are often "less expressive per param" than their old-school non-parallelizable RNN counterparts, though you could argue that they make up for that in practice by being more powerful per unit of training compute via much better training efficiency.