I used to think it was the quadratic complexity of attention but I guess that's not a concern anymore as they've made more hardware aware kernels of attention?
The other I remember is continual learning but that may be solved in near-term future.
I am not completely confident about it.
I used to think it was the quadratic complexity of attention but I guess that's not a concern anymore as they've made more hardware aware kernels of attention? The other I remember is continual learning but that may be solved in near-term future. I am not completely confident about it.