It 100% needs to be online. Imagine you're trying to think about a new tabletop puzzle, and every time a puzzle piece leaves your direct field of view, you no longer know about that puzzle piece.
You can try to keep all of the puzzle pieces within your direct field of view, but that divides your focus. You can hack that and make your field of view incredibly large, but that can potentially distort your sense of the relationships between things, their physical and cognitive magnitude. Bigger context isn't the answer, there's a missing fundamental structure and function to the overall architecture.
What you need is memory, that works when you process and consume information, at the moment of consumption. If you meet a new person, you immediately memorize their face. If you enter a room, it's instantly learned and mapped in your mind. Without that, every time you blinked after meeting someone new, it'd be a total surprise to see what they looked like. You might never learn to recognize and remember faces at all. Or puzzle pieces. Or whatever the lack of online learning kept you from recognizing the value of persistent, instant integration into an existing world model.
You can identify problems like this for any modality, including text, audio, tactile feedback, and so on. You absolutely, 100% need online, continuous learning in order to effectively deal with information at a human level for all the domains of competence that extend to generalizing out of distribution.
It's probably not the last problem that needs solving before AGI, but it is definitely one of them, and there might only be a handful left.
Mammals instantly, upon perceiving a novel environment, map it, without even having to consciously make the effort. Our brains operate in a continuous, plastic mode, for certain things. Not only that, it can be adapted to abstractions, and many of those automatic, reflexive functions evolved to handle navigation and such allow us to simulate the future and predict risk and reward over multiple arbitrary degrees of abstraction, sometimes in real time.
https://www.nobelprize.org/uploads/2018/06/may-britt-moser-l...
That's not how training works - adjusting model weights to memorize a single data item is not going to fly.
Model weights store abilities, not facts - generally.
Unless the fact is very widely used and widely known, with a ton of context around it.
The model can learn the day JFK died because there are millions of sparse examples of how that information exists in the world, but when you're working on a problem, you might have 1 concern to 'memorize'.
That's going to be something different than adjusting model weights as we understand them today.
LLMs are not mammals either, it's helpful analogy in terms of 'what a human might find useful' but not necessary in the context of actual llm architecture.
The fact is - we don't have memory sorted out architecturally - it's either 'context or weights' and that's that.
Also critically: Humans do not remember the details of the face. Not remotely. They're able to associate it with a person and name 'if they see it again' - but that's different than some kind of excellent recall. Ask them to describe features in detail and maybe we can't do it.
You can see in this instance, this may be related to kind of 'soft lookup' aka associating an input with other bits of information which 'rise to the fore' as possibly useful.
But overall, yes, it's fair to take the position that we'll have to 'learn from context in some way'.