>This is not true, even if you have the kv cache "hot" in vram. That's just not how transformers work.
I'm not strong on how transformers work, but this is something that is verifiable empirically, and has nothing to do with how transformers work.
Use any LLM through an API. Send 1 input token, and 10k output tokens. Then send 1 input token (different to avoid cache) and ask for 20k output tokens. If the cost and time to compute is exactly twice, then my theory holds.
>No, they are not in practice. There are pure engineering considerations here. How do you route, when you evict kv cache, where you evict it to (RAM/nvme), how long you keep it, etc. At the scale of oAI/goog/anthropic these are not easy tasks, and the cost is definetly not 0.
I was a bit loose in my definition of "virtually free", here is a more formal statement. The price of GPU compute is orders of magnitude more expensive than the cost of RAM, and the costs of caching inputs are tied to RAM and not GPU. To give an example of the most expensive price component, capital, an H100 costs 25K$, 1GB of RAM costs 10$. Therefore the cost component of cached inputs is negligible.
>Think about a normal session. A user might prompt something, wait for the result, re-prompt (you hit "hot" cache) and then go for a coffee. They come back 5 minutes later. You can't keep that in "hot" cache. Now you have to route the next message in that thread to a) a place where you have free "slots"; b) a place that can load the kv cache from "cold" storage and c) a place that has enough "room" to handle a possible max ctx request. These are not easy things to do in practice, at scale.
As I said, sure it's not free, but you are talking about negligible costs when compared to the GPU capex. It's interesting to note that the API provider would charge the same no matter if the inference state is cached for 5 minutes, 1ms or 1 hour. So clearly the thing is not optimally priced yet.
If cached inputs from API calls become your primary cost, then it makes sense to move to an API that pays less for cached inputs (if you haven't already done that), then look into APIs where you can control when and when not to cache and for how long to hold it, and finally, into renting GPU and self-hosting an open weights model.
To give a concrete example, suppose we are building a feature where we want to stop upon hitting an ambiguous output token, our technical approach is to generate one output token at a time, check the logprobs, and continue if the prob of the top token is >90%, otherwise, halt. If we generate 1M output tokens with an API, we will pay for roughly 1M^2/2 cached input tokens, while if we self-host, the compute time will be almost identical to that of just generating 1M output tokens. Obviously if we do that with an API it will be almost entirely profit for the API provider, it's just not a use case that has been optimized for, we are in the early days of any type of deeply technical parametrization being done yet, everyone is just either prompting all the way down, or hacking with models directly, doesn't seem like a lot of in between.