logoalt Hacker News

datadrivenangelyesterday at 12:02 AM2 repliesview on HN

In my experience once you get to ~30 gigs of ram for a model like Gemma4, the rest of the 128g of memory is simply nice to have. The speed and costs are what make it tough though, because its slower and more expensive than the same model served on a big accelerator card, and is going to be worse than a frontier model.


Replies

ElectricalUniontoday at 11:38 AM

You need the rest of the ram for the context. If you don't want to end up with a toy context or quantized lossy context, is pretty easy to end up having to spend up 50+GB just for the KV cache, per simutaneous inference slot.

show 1 reply
digitaltreesyesterday at 1:00 AM

I wonder if it really needs to be worse. I am playing with the idea of fine tuning a model on my exact stack and coding patterns. I suspect I could get better performance by training “taste” into a model rather than breadth.

show 3 replies