While 100 million tokens sounds a lot, think about it for a bit, and you’ll see why it is basically nothing. Try to cram a human lifetime of sounds, smells, video and more sensory data into 100 million tokens. Heck, try to process the video plot of a single series into that window. It just won’t work, it won’t scale, and is laughable compared to contextual memory. I’m not saying that to belittle the authors of the paper but the reality is that this has very little to do with transient long term memory.
You don't remember a lifetime of smells. You don't have any memories from huge swaths of time. There are entire years of your life compressed down to vibes and a handful of events you largely misremember.
I think you underestimate just how much information 100M words-ish of information is. It's like a 300,000 page novel. That's a 50 foot (~15 meter) thick book.
Surely with (much less than) 300K pages you could describe every meaningful detail of a video series' plot. You don't need to remember the exact pixel values.
You can also scale the numbers up. I specifically chose a relatively small model and short context length as a reference, so 100x bigger is not out of question. At that point, with a 10B token capacity, you are looking at all of English Wikipedia in a single state.