logoalt Hacker News

lostmsuyesterday at 8:52 PM1 replyview on HN

> To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.

This is wrong. Current models still use some full attention layers AFAIK, and their computational cost grows linearly (per token) with the token number.


Replies

TZubiriyesterday at 9:09 PM

I have seen exactly one model that charges more for longer contexts:

https://ai.google.dev/gemini-api/docs/pricing

Gemini 1M context window

That said the cost increase isn't very significant, approximately 2x at the longer end of the context window.

This is in stark contrast with the quadratic phenomenon claimed by the article.

show 1 reply