logoalt Hacker News

Yukonvtoday at 7:03 AM1 replyview on HN

Good to see Ollama is catching up with the times for inference on Mac. MLX powered inference makes a big difference, especially on M5 as their graphs point out. What really has been a game changer for my workflow is using https://omlx.ai/ that has SSD KV cold caching. No longer have to worry about a session falling out of memory and needing to prefill again. Combine that with the M5 Max prefill speed means more time is spend on generation than waiting for 50k+ content window to process.


Replies

davesquetoday at 8:41 PM

Yeah omlx seems to me like the front runner right now for running MLX models locally in agent workflows (which depend heavily on caching).