logoalt Hacker News

mattalextoday at 5:19 PM0 repliesview on HN

Because the size of the attention matrix depends on the number of tokens (this is what makes attention N^2). If you don't care about having a flexible number of input tokens (e.g. in image processing) you can learn a fixed routing matrix. This is known as an MLP mixer https://arxiv.org/pdf/2105.01601 : you have one layer that processes each token in isolation ("vertical MLP") but ignores the inter-token connections, followed by a layer that combines between tokens ("horizontal MLP") that treats the internals of every token identically.