logoalt Hacker News

LuxBennutoday at 5:01 AM3 repliesview on HN

Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path


Replies

yg1112today at 5:50 PM

The key difference is that MLX's array model assumes unified memory from the ground up. llama.cpp's Metal backend works fine but carries abstractions from the discrete GPU world — explicit buffer synchronization, command buffer boundaries — that are unnecessary when CPU and GPU share the same address space. You'll notice the gap most at large context lengths where KV cache pressure is highest.

show 1 reply
goldenarmtoday at 9:07 AM

How many tokens per second?

zozbot234today at 8:37 AM

They initially messed up this launch and overwrote some of the GGUF models in their library, making them non-downloadable on platforms other than Apple Silicon. Hopefully that gets fixed.