logoalt Hacker News

in-silicoyesterday at 8:38 PM2 repliesview on HN

While there is a limit to the amount of information you can fit in a fixed-size state, the theoretical ceiling is pretty high.

A Hebbian associative matrix (one of the simplest and weakest memory constructions) can store about 0.7 bits of information per parameter. If you have a state with 300M parameters (the size of a Llama 3 8B KV cache at 10K context length), and a context with 2.1 bits of entropy per token (a reasonable estimate), then the state can encode 100M tokens worth of information.

Real models obviously aren't powerful enough to operate at the limit, but you can see why this is a promising research direction.


Replies

RandomBKtoday at 4:40 AM

> context with 2.1 bits of entropy per token

Can you elaborate on this? I'm seen estimates of ~1.5bit per English letter, and tokens encode a lot more than that - sometimes full words, with multimodal even more. If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.

show 1 reply
usernametaken29yesterday at 9:33 PM

While 100 million tokens sounds a lot, think about it for a bit, and you’ll see why it is basically nothing. Try to cram a human lifetime of sounds, smells, video and more sensory data into 100 million tokens. Heck, try to process the video plot of a single series into that window. It just won’t work, it won’t scale, and is laughable compared to contextual memory. I’m not saying that to belittle the authors of the paper but the reality is that this has very little to do with transient long term memory.

show 2 replies