Yep that happens with coding agents sending a very large system prompt. And also when later tool calling feed it large files or diffs. But with the M3 ultra the prefill speed is almost 500 t/s that is quite into the very usable zone. With M3 max you need a bit more patience but it works well and as it emits the think process if you use the pi agent you don't wait: you read non censored chain of though. I posted a video on X yesterday using it with my m3 max. It spills tokens at a decent speed.
Given how small the KV cache for this model seems to be for small contexts, can you clarify how the engine behaves if you try to run increasingly larger batches on your prosumer hardware (RAM 128 GB)? Does it eventually become compute limited?
Also, can the engine support transparent mmap use for fetching weights from disk on-demand, at least when using pure CPU? (GPU inference might be harder, since it's not clear how page faults would interact with running a shader.)
If the latter test is successful, next would be testing Macs with more limited RAM, first running simple requests (would be quite slow) then larger batches (might be more worthwhile if one can partially amortize the cost of fetching weights from storage, and be bottlenecked by other factors).