The attention residuals paper uses attention across layers for the same token, in addition to the usual case of attention across tokens within the same layer, but it doesn't do anything to address the "lost in too much context" problem. At least the number of layers is currently still low enough that there's probably no equivalent "lost in too many layers" problem yet.
Seems you are right, i have to re-read a few things;