The attention residuals paper uses attention across layers for the same token, in addition to the us...

yorwba • today at 11:04 AM • 1 reply • view on HN

The attention residuals paper uses attention across layers for the same token, in addition to the usual case of attention across tokens within the same layer, but it doesn't do anything to address the "lost in too much context" problem. At least the number of layers is currently still low enough that there's probably no equivalent "lost in too many layers" problem yet.

Replies

AntiUSAbah • today at 12:24 PM

Seems you are right, i have to re-read a few things;

alt Hacker News

Replies