> Can you elaborate on this? I'm seen estimates of ~1.5bit per English letter The referenc...

in-silico • today at 4:57 AM • 0 replies • view on HN

> Can you elaborate on this? I'm seen estimates of ~1.5bit per English letter

The reference I always go back to is the GPT-3 paper. The cross-entropy loss (an upper bound for entropy) got down to 1.75 nats (2.5 bits). I took 2.1 because 2.5 is an upper bound and I wanted the estimate to end up as a round number.

> If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.

Here's the thing: the concepts that the model stores in the KV cache are a deterministic function of the input tokens. Similar to the data processing inequality, this implies that no entropy is actually added.

Looking at it mechanically, a sufficiently powerful model only needs to encode the tokens and can recompute concepts later as needed.

alt Hacker News