It's not just the architecture but also the data - the decoder only approach lets you train in ...

GardenLetter27 • today at 8:16 AM • 0 replies • view on HN

It's not just the architecture but also the data - the decoder only approach lets you train in parallel over blocks of text (no RNN serial waiting), that allows you train on much, much more data.

alt Hacker News