Love this, even if can't use it atm (not got the h/w - only 96gb on M2 Max). I get it the ...

ljosifov • today at 9:21 AM • 2 replies • view on HN

Love this, even if can't use it atm (not got the h/w - only 96gb on M2 Max). I get it the general comp/public will find it unusable or worse. Reminds me of how home computers were - mere toys - before they became personal computers (PC). On my h/w the only passable combo for me atm is pi agent + llama.cpp + nemotron cascade-2 model: to 1M context, hybrid arch doesn't crash & burn 1/N^2 with context depths of 10K-50K-100K used by code agents. Was on a plane without Internet the other day. Brought a smile to my face that I could run pi agent (with llama.cpp serving), and it was just about usable at 40-30 tok/s. Afaik the usual API speeds are double that, 60-80 tok/s. Sensors showing using 60W when running inference. So battery probably would not last more than >3h. Model only 30B in size leaves plenty of space for KV-caches, and other programs - even at generous 8-bit quant. Only 3B active params at one time (with MoE A3B) is about the most that ageing M2 Max can carry it seems.

Replies

embedding-shape • today at 11:36 AM

> even if can't use it atm (not got the h/w - only 96gb on M2 Max).

Not sure if it works different on macOS, but with CUDA + DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf I can fit it within 96GB of VRAM, together with context, so theoretically I feel like you should too, unless macOS uses GB of RAM/VRAM for the OS/display by default.

➕ show 1 reply

zozbot234 • today at 9:39 AM

It should work with 96GB, especially on a limited context. But the M2 Max is a bit slower, yes.

alt Hacker News

Replies