No FFN is blowing my mind. This is pretty much "Attention Is ACTUALLY All You Need". Reminds me of BERT Q&A which would return indices into the input context, but even that had a FFN. Really exciting work.
I guess this had always been bugging me. I get while you need activation/non-linearities, but do you really need the FFN in Transformers? People say that without it you can't do "knowledge/fact" lookups, but you still have the Value part of the attention, and if your question is "what is the capital of france" the LLM could presumably extract out "paris" from the value vector during attention computation instead of needing the FFN for that. Deleting the FFN is probably way worse in terms of scaling laws or storing information, but is it an actual architectural dead-end (in the way that deleting activation layer clearly would be since it'd collapse everythig to a linear function).
I guess this had always been bugging me. I get while you need activation/non-linearities, but do you really need the FFN in Transformers? People say that without it you can't do "knowledge/fact" lookups, but you still have the Value part of the attention, and if your question is "what is the capital of france" the LLM could presumably extract out "paris" from the value vector during attention computation instead of needing the FFN for that. Deleting the FFN is probably way worse in terms of scaling laws or storing information, but is it an actual architectural dead-end (in the way that deleting activation layer clearly would be since it'd collapse everythig to a linear function).