It's not attention that's the problem, it's how we train networks offline with backpr...

solid_fuel • today at 3:19 AM • 0 replies • view on HN

It's not attention that's the problem, it's how we train networks offline with backprop.

LLMs are the most successful form of neural network we have, and that's because they are token prediction machines. Token predictors are easy to train because we're surrounded by written text - there's data nicely structured for use as training data for token prediction everywhere, free for the taking (especially if you ignore copyright law and robots.txt and crawl the entire web).

We can't train an LLM to have a more complex internal thought loop because there's no way to synthesize or acquire that internal training data in a way where you could perform backprop training with it.

Even "train of thought" models are reducing complex thoughts to simple token space as they iterate, and that is required because backprop only works when you can compute the delta between <input state> and <desired output state>. It can't work for anything more complicated or recursive than that.

alt Hacker News