I had no idea that LLMs (or the transformer architecture) were within reach of complexity theory. But if transformers "can be" exponentially more succinct than RNNs, doesn't that mean we're approaching optimality?
> doesn't that mean we're approaching optimality?
No.Transformers are Markov chains [1]. Somewhere around this fascinating site [2] I read that stateful models have an advantage. Author provided an example, a state machine with two states A and B, where at state A transitions are to state A (output 0) and to state B (output 1) with equal probability and at state B the transition is always to state A and output is always 1.
For this state machine just one bit of memory can make an optimal prediction that ones always go in pairs, whereas Markov chain will approximate this prediction and never reach optimality.
[1] https://arxiv.org/abs/2410.02724
[2] https://bactra.org/
No. We have an infinitely more succinct formalism, it's Turing machines. Succinctness is not necessarily a desirable property, it just says where on the capability-tractability tradeoff something is. Turing machines can express literally anything computable, but in exchange we can't use computers to reason about them in general (Rice's Theorem). Regexes are much more limited, they famously can't even recognize HTML, but we get to automatically prove things about them.