I looked into this "GRAM" stuff a sibling comment links further to, and just to say:
- this gets reinvented/rediscovered constantly under different names
- it cant be trained very well (right now, will change)
- massive theoretical improvements over current models (log_2(vocabsize)=17, residual stream dim is thousands of dimensions, recursivity means more information bandwidth by ~3 OoM)
- BUT it cant be interpreted or aligned <- this is why no one uses it and no one talks about it. the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used
I follow this stuff closely, I think I know what I'm talking about (edited for formating)
Could you explain how/why GRAM cannot be interpreted or aligned how current LLMs are? Not very familiar how it works
> - this gets reinvented/rediscovered constantly under different names
What are the different names? I haven't seen this before.
> - it cant be trained very well (right now, will change)
If you're sure it will change, then why are you certain that it hasn't yet, and if it's proven a 5000x boost in reasoning... why aren't they exploring this path more aggressively?
> the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used
Surely someone is willing to take a 5000x boost in reasoning on a small research model... None of them have even tried anything resembling this AFAIK. It does not seem like something 100% obvious to them.