The key difference is that MLX's array model assumes unified memory from the ground up. llama.c...

yg1112 • today at 5:50 PM • 1 reply • view on HN

The key difference is that MLX's array model assumes unified memory from the ground up. llama.cpp's Metal backend works fine but carries abstractions from the discrete GPU world — explicit buffer synchronization, command buffer boundaries — that are unnecessary when CPU and GPU share the same address space. You'll notice the gap most at large context lengths where KV cache pressure is highest.

Replies

lioeters • today at 8:38 PM

Insightful comment, thanks!

alt Hacker News

Replies