How many people/hackernews can run a 397b param model at home? Probably like 20-30.
You can rent a cloud H200 with 140GB VRAM in a server with 256GB system ram for $3-4/hr.
I’ve mentioned this as an option in other discussions, but if you don’t care that much about tok/sec, 4x Xeon E7-8890 v4s with 1TB of DDR3 in a supermicro X10QBi will run a 397b model for <$2k (probably closer to $1500). Power use is pretty high per token but the entry price cannot be beat.
Full (non-quantized, non-distilled) DeepSeek runs at 1-2 tok/sec. A model half the size would probably be a little faster. This is also only with the basic NUMA functionality that was in llama.cpp a few months ago, I know they’ve added more interesting distribution mechanisms recently that I haven’t had a chance to test yet.
The point is that open weights turns puts inference on the open market, so if your model is actually good and providers want to serve it, it will drive costs down and inference speeds up. Like Cerebras running Qwen 3 235B Instruct at 1.4k tps for cheaper than Claude Haiku (let that tps number sink in for a second. For reference, Claude Opus runs ~30-40 tps, Claude Haiku at ~60. Several orders of magnitude difference). As a company developing models, it means you can't easily capture the inference margins even though I believe you get a small kickback from the providers.
So I understand why they wouldn't want to go open weight, but on the other hand, open weight wins you popularity/sentiment if the model is any good, researchers (both academic and other labs) working on your stuff, etc etc. Local-first usage is only part of the story here. My guess is Qwen 3.5 was successful enough that now they want to start reaping the profits. Unfortunately most of Qwen 3.5's success is because it's heavily (and successfully!) optimized for extremely long-context usage on heavily constrained VRAM (i.e. local) systems, as a result of its DeltaNet attention layers.
It only has 17b active params, it's a mixture of experts model. So probably a lot more people than you realize!
I can (barely, but sustainably) run Q3.5 397B on my Mac Studio with 256GB unified. It cost $10,000 but that's well within reach for most people who are here, I expect.
I'm running it on dual DGX Sparks.
The 397B model can be run at home with the weights stored on an SSD (or on 2 SSDs, for double throughput).
Probably too slow for chat, but usable as a coding assistant.
Running the mxfp4 unsloth quant of qwen3.5-397b-a17b, I get 40 tps prefill, 20tps decode.
AMD threadripper pro 9965WX, 256gb ddr5 5600, rtx 4090.
It doesn't matter how many can run it now, it's about freedom. Having a large open weights model available allows you to do things you can't do with closed models.
This is like saying that Open source is not important because I don't have a machine to run it on right now. Of course it is important. We don't have any state of the art Language models that are open source, but some are still Open Weight. Better than nothing, and the only way to secure some type of privacy and control over own AI use. It is my goal to run these large models locally eventually; if they all go away that is not even a possibility. . .