The model absolutely can be run at home. There even is a big community around running large models locally: https://www.reddit.com/r/LocalLLaMA/
The cheapest way is to stream it from a fast SSD, but it will be quite slow (one token every few seconds).
The next step up is an old server with lots of RAM and many memory channels with maybe a GPU thrown in for faster prompt processing (low two digits tokens/second).
At the high end, there are servers with multiple GPUs with lots of VRAM or multiple chained Macs or Strix Halo mini PCs.
The key enabler here is that the models are MoE (Mixture of Experts), which means that only a small(ish) part of the model is required to compute the next token. In this case, there are 32B active parameters, which is about 16GB at 4 bit per parameter. This only leaves the question of how to get those 16GB to the processor as fast as possible.
Its often pointed out in the first sentence of a comment how a model can be run at home, then (maybe) towards the end of the comment it’s mentioned how it’s quantized.
Back when 4k movies needed expensive hardware, no one was saying they could play 4k on a home system, then later mentioning they actually scaled down the resolution to make it possible.
The degree of quality loss is not often characterized. Which makes sense because it’s not easy to fully quantify quality loss with a few simple benchmarks.
By the time it’s quantized to 4 bits, 2 bits or whatever, does anyone really have an idea of how much they’ve gained vs just running a model that is sized more appropriately for their hardware, but not lobotomized?
> The model absolutely can be run at home. There even is a big community around running large models locally
IMO 1tln parameters and 32bln active seems like a different scale to what most are talking about when they say localLLMs IMO. Totally agree there will be people messing with this, but the real value in localLLMs is that you can actually use them and get value from them with standard consumer hardware. I don't think that's really possible with this model.
I'd take "running at home" to mean running on reasonably available consumer hardware, which your setup is not. You can obviously build custom, but who's actually going to do that? OP's point is valid
>The model absolutely can be run at home.
There is a huge difference between "look I got it to answer the prompt: '1+1='"
and actually using it for anything of value.
I remember early on people bought Macs (or some marketing team was shoveling it), and proposing people could reasonably run the 70B+ models on it.
They were talking about 'look it gave an answer', not 'look this is useful'.
While it was a bit obvious that 'integrated GPU' is not Nvidia VRAM, we did have 1 mac laptop at work that validated this.
Its cool these models are out in the open, but its going to be a decade before people are running them at a useful level locally.
You can run AI models on unified/shared memory specifically on Windows, not Linux (unfortunately). It uses the same memory sharing system that Microsoft originally had built for gaming when a game would run out of vram. If you:
- have an i5 or better or equivalent manufactured within the last 5-7 years
- have an nvidia consumer gaming GPU (RTX 3000 series or better) with at least 8 GB vram
- have at least 32 GB system ram (tested with DDR4 on my end)
- build llama-cpp yourself with every compiler optimization flag possible
- pair it with a MoE model compatible with your unified memory amount
- and configure MoE offload to the CPU to reduce memory pressure on the GPU
then you can honestly get to about 85-90% of cloud AI capability totally on-device, depending on what program you interface with the model.
And here's the shocking idea: those system specs can be met by an off the shelf gaming computer from, for example, Best Buy or Costco today and right now. You can literally buy a CyberPower or iBuyPower model, again for example, download the source, run the compilation, and have that level of AI inference available to you.
Now, the reason why it won't work on Linux is that the Linux kernel and Linux distros both leave that unified memory capability up to the GPU driver to implement. Which Nvidia hasn't done yet. You can code it somewhat into source code, but it's still super unstable and flaky from what I've read.
(In fact, that lack of unified memory tech on Linux is probably why everyone feels the need to build all these data centers everywhere.)