Waiting for official support in llama.cpp. There is a fork that can run a lightly quantized (Q2 exp...

zozbot234 • today at 9:17 AM • 1 reply • view on HN

Waiting for official support in llama.cpp. There is a fork that can run a lightly quantized (Q2 expert layers) DeepSeek V4 Flash in 128GB RAM without offloading weight fetches from disk.

Replies

karel-3d • today at 10:04 AM

Ouch. Can't run that on my M4 mac with 48GB RAM.

alt Hacker News

Replies