Multiply "inference + backwards pass (~2x inference cost) + activations (vram overhead)" b...

thesz • yesterday at 9:29 PM • 2 replies • view on HN

Multiply "inference + backwards pass (~2x inference cost) + activations (vram overhead)" by batch size (thousands) to get to the actual RAM and compute cost. Optimizer like ADAM adds only two or three model-sized overhead.

And last, but not least, you need only one hidden layer kept in RAM for inference, but you need all of them (61 for Deepseek models) kept in RAM for computing gradient for one sample.

Replies

xyhopguy • yesterday at 11:30 PM

Microbatch size is a hyperparameter, it can be set to 1 and work just as effectively. With gradient accumulation it's equivalent even. Large batch sizes are used to increase parallelism, and sometimes to reduce variance in the loss signal (at the cost of increased bias).

Batch size is frequently limited by compute bottlenecks well before memory.

galaxyLogic • yesterday at 11:14 PM

Does it matter what is the difference in size of needed inputs for inference vs. training?

alt Hacker News

Replies