logoalt Hacker News

dabockstertoday at 7:02 PM0 repliesview on HN

You can run AI models on unified/shared memory specifically on Windows, not Linux (unfortunately). It uses the same memory sharing system that Microsoft originally had built for gaming when a game would run out of vram. If you:

- have an i5 or better or equivalent manufactured within the last 5-7 years

- have an nvidia consumer gaming GPU (RTX 3000 series or better) with at least 8 GB vram

- have at least 32 GB system ram (tested with DDR4 on my end)

- build llama-cpp yourself with every compiler optimization flag possible

- pair it with a MoE model compatible with your unified memory amount

- and configure MoE offload to the CPU to reduce memory pressure on the GPU

then you can honestly get to about 85-90% of cloud AI capability totally on-device, depending on what program you interface with the model.

And here's the shocking idea: those system specs can be met by an off the shelf gaming computer from, for example, Best Buy or Costco today and right now. You can literally buy a CyberPower or iBuyPower model, again for example, download the source, run the compilation, and have that level of AI inference available to you.

Now, the reason why it won't work on Linux is that the Linux kernel and Linux distros both leave that unified memory capability up to the GPU driver to implement. Which Nvidia hasn't done yet. You can code it somewhat into source code, but it's still super unstable and flaky from what I've read.

(In fact, that lack of unified memory tech on Linux is probably why everyone feels the need to build all these data centers everywhere.)