> To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.
This is wrong. Current models still use some full attention layers AFAIK, and their computational cost grows linearly (per token) with the token number.
I have seen exactly one model that charges more for longer contexts:
https://ai.google.dev/gemini-api/docs/pricing
Gemini 1M context window
That said the cost increase isn't very significant, approximately 2x at the longer end of the context window.
This is in stark contrast with the quadratic phenomenon claimed by the article.