I’ve mentioned this as an option in other discussions, but if you don’t care that much about tok/sec, 4x Xeon E7-8890 v4s with 1TB of DDR3 in a supermicro X10QBi will run a 397b model for <$2k (probably closer to $1500). Power use is pretty high per token but the entry price cannot be beat.
Full (non-quantized, non-distilled) DeepSeek runs at 1-2 tok/sec. A model half the size would probably be a little faster. This is also only with the basic NUMA functionality that was in llama.cpp a few months ago, I know they’ve added more interesting distribution mechanisms recently that I haven’t had a chance to test yet.