Isn’t the purpose of self attention exactly to recognize the relevance of some tokens over others?
That may help with tokens being "ignored" while still being in the context window, but not context window size costs and limitations in the first place.
[dead]
That may help with tokens being "ignored" while still being in the context window, but not context window size costs and limitations in the first place.