> Can you elaborate on this? I'm seen estimates of ~1.5bit per English letter
The reference I always go back to is the GPT-3 paper. The cross-entropy loss (an upper bound for entropy) got down to 1.75 nats (2.5 bits). I took 2.1 because 2.5 is an upper bound and I wanted the estimate to end up as a round number.
> If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.
Here's the thing: the concepts that the model stores in the KV cache are a deterministic function of the input tokens. Similar to the data processing inequality, this implies that no entropy is actually added.
Looking at it mechanically, a sufficiently powerful model only needs to encode the tokens and can recompute concepts later as needed.