The 27B model is dense, so is relatively slow. The 35B-A3B model is marginally weaker but being MoE...

mft_ • yesterday at 5:44 PM • 2 replies • view on HN

The 27B model is dense, so is relatively slow. The 35B-A3B model is marginally weaker but being MoE is much faster - like ~4-8x faster in basic benchmarks on my M1 Max.

For comparison, I just ran a couple of quick benchmarks (default settings) with llama-bench:

Qwen3.6-35B-A3B at Q6_K_XL gave 858 t/s pp512 (prompt processing) and 43 t/s tg128 (token generation).

Qwen3.6-27B at Q4_K_XL gave 103 t/s pp512 and 8 t/s tg128.

Replies

stebalien • yesterday at 10:05 PM

Have you tried enabling MTP? Those numbers are similar to what I was getting on my Strix Halo box, but configuring/enabling MTP doubled the TG speed of the 27B model (18-20 t/s now).

pixelesque • yesterday at 6:25 PM

Thanks for the info.

alt Hacker News

Replies