logoalt Hacker News

zozbot234today at 5:21 AM1 replyview on HN

This effect is likely even larger when you consider that the raw cost per inferred token grows linearly with context, rather than being constant. So longer tasks performed with higher-context models will cost quadratically more. The computational cost also grows super-linearly with model parameter size: a 20B-active model is more than four times the cost of a 5B-active model.


Replies

tibbartoday at 5:28 AM

Doesn't context cacheing mostly eliminate this problem? (I suppose for enough context the 90% discount is eventually a lot anyway)

show 1 reply