A good rule of thumb is to think that one param is one unit of storage. The "default" unit...

NitpickLawyer • yesterday at 9:01 AM • 2 replies • view on HN

A good rule of thumb is to think that one param is one unit of storage. The "default" unit of storage these days is bf16 (i.e. 16 bits for 1 weight). So for a 80B model that'll be ~160GB of weights. Then you have quantisation, usually in 8bit and 4bit. That means each weight is "stored" in 8bits or 4bits. So for a 80B model that'll be ~80GB in fp8 and ~40GB in fp4/int4.

But in practice you need a bit more than that. You also need some space for context, and then for kv cache, potentially a model graph, etc.

So you'll see in practice that you need 20-50% more RAM than this rule of thumb.

For this model, you'll need anywhere from 50GB (tight) to 200GB (full) RAM. But it also depends how you run it. With MoE models, you can selectively load some experts (parts of the model) in VRAM, while offloading some in RAM. Or you could run it fully on CPU+RAM, since the active parameters are low - 3B. This should work pretty well even on older systems (DDR4).

Replies

johntash • yesterday at 11:28 PM

Can you explain how context fits into this picture by any chance? I sort of understand the vram requirement for the model itself, but it seems like larger context windows increases the ram requirement by a lot more?

theanonymousone • yesterday at 10:17 AM

But the RAM+VRAM can never be less than the size of the total (not active) model, right?

➕ show 1 reply

alt Hacker News

Replies