logoalt Hacker News

TZubiriyesterday at 9:02 PM0 repliesview on HN

that cost is proportional to how long the cache is held. Currently the cache is not application controlled, it's like CPU caches.

If you hit the cache 1ns after it's been held, you get charged the same as if it's held for 5 minutes or 1 hour.

Also, in terms of LLM APIs, I'm almost certain that the state is offloaded onto RAM and then reloaded onto the GPU memory. If you are renting a GPU, you could keep the inferred state in GPU memory. If you are just holding it for very short periods of time, like my example of generating 1 output token at a time and doing some programmatic logic, then it's currently prohibitively expensive to use an API and you must self-host.