If you're on a Mac, use the MLX backend versions which are considerably faster than the GGML ba...

Patrick_Devine • last Thursday at 5:47 PM • 1 reply • view on HN

If you're on a Mac, use the MLX backend versions which are considerably faster than the GGML based versions (including llama.cpp) and you don't need to fiddle with the context size. The models are `qwen3.6:35b-a3b-nvfp4`, `qwen3.6:35b-a3b-mxfp8`, and `qwen3.6:35b-a3b-mlx-bf16`.

Replies

egorfine • last Friday at 11:11 AM

I was comparing various models at M5 Pro 48GB RAM MLX vs GGUF and found that MLX models have a higher time to first token (sometimes by an order of magnitude) while tokens/sec and memory usage is same as GGUF.

Gemma 3 27B q4:

* MLX: 16.7 t/s, 1220ms ttft

* GGUF: 16.4 t/s, 760ms ttft

Gemma 4 31B q8:

* MLX: 8.3 t/s, 25000ms ttft

* GGUF: 8.4 t/s, 1140ms ttft

Gemma 4 A4B q8:

* MLX: 52 t/s, 1790ms ttft

* GGUF: 51 t/s, 380ms ttft

All comparisons done in LM Studio, all versions of everything are the latest.

alt Hacker News

Replies