logoalt Hacker News

endymi0ntoday at 9:47 AM5 repliesview on HN

One. Trillion. Even on native int4 that’s… half a terabyte of vram?!

Technical awe at this marvel aside that cracks the 50th percentile of HLE, the snarky part of me says there’s only half the danger in giving something away nobody can run at home anyway…


Replies

johndoughtoday at 11:58 AM

The model absolutely can be run at home. There even is a big community around running large models locally: https://www.reddit.com/r/LocalLLaMA/

The cheapest way is to stream it from a fast SSD, but it will be quite slow (one token every few seconds).

The next step up is an old server with lots of RAM and many memory channels with maybe a GPU thrown in for faster prompt processing (low two digits tokens/second).

At the high end, there are servers with multiple GPUs with lots of VRAM or multiple chained Macs or Strix Halo mini PCs.

The key enabler here is that the models are MoE (Mixture of Experts), which means that only a small(ish) part of the model is required to compute the next token. In this case, there are 32B active parameters, which is about 16GB at 4 bit per parameter. This only leaves the question of how to get those 16GB to the processor as fast as possible.

show 5 replies
mrinterwebtoday at 6:23 PM

VRAM is the new moat, and controlling pricing and access to VRAM is part of it. There will be very few hobbyists who can run models of this size. I appreciate the spirit of making the weights open, but realistically, it is impractical for >99.999% of users to run locally.

wongarsutoday at 11:34 AM

Which conveniently fits on one 8xH100 machine. With 100-200 GB left over for overhead, kv-cache, etc.

the_sleaze_today at 5:50 PM

3,998.99 for 500gb of RAM on amazon

"Good Luck" - Kimi <Taken voice>

Davidzhengtoday at 11:19 AM

that's what intelligence takes. Most of intelligence is just compute