If it can be made orthogonal, can you go a step further and diagonalize it? The storage and performance improvement from that would be huge.
I don’t know AI, but, weight matrices aren’t square in general, right? My first guess for something like this would be to take the SVD instead, since you can always do that, but I’m sure that’s been tried already.
You can take the output of the matrix LSTM, which is going to be matrix for each token, and compute the SVD. To get better storage, we want U and V to be the same for all tokens, so that we can operate on the diagonal S matrix. But LSTM is likely highly nonlinear, U and V will be vastly different for different tokens.