Very impressive. One thing that seems odd to me is that is at like 4 minutes before it starts a resp...

layoric • yesterday at 11:33 PM • 2 replies • view on HN

Very impressive. One thing that seems odd to me is that is at like 4 minutes before it starts a response for large input? I don't use mac hardware for LLMs, but that is quite surprising and would seem to be a pretty large stumbling block for practical usage.

Edit: Caching story makes a lot more sense for regular usage: > Claude Code may send a large initial prompt, often around 25k tokens, before it starts doing useful work. Keep --kv-disk-dir enabled: after the first expensive prefill, the disk KV cache lets later continuations or restarted sessions reuse the saved prefix instead of processing the whole prompt again.

Replies

antirez • today at 5:35 AM

Yep that happens with coding agents sending a very large system prompt. And also when later tool calling feed it large files or diffs. But with the M3 ultra the prefill speed is almost 500 t/s that is quite into the very usable zone. With M3 max you need a bit more patience but it works well and as it emits the think process if you use the pi agent you don't wait: you read non censored chain of though. I posted a video on X yesterday using it with my m3 max. It spills tokens at a decent speed.

➕ show 1 reply

MrBuddyCasino • today at 11:04 AM

Prefill is faster on the M5s, the older generations are a bit weak.

alt Hacker News

Replies