Waiting for official support in llama.cpp. There is a fork that can run a lightly quantized (Q2 expert layers) DeepSeek V4 Flash in 128GB RAM without offloading weight fetches from disk.
Ouch. Can't run that on my M4 mac with 48GB RAM.
Ouch. Can't run that on my M4 mac with 48GB RAM.