But you aren't trying out the model. You quantized beyond what people generally say is acceptable, and reduced the number of experts, which these models are not designed for.
Even worse, the github repo advertises:
> Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.
But you aren't trying out the model. You quantized beyond what people generally say is acceptable, and reduced the number of experts, which these models are not designed for.
Even worse, the github repo advertises:
> Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.
Hiding the fact that active params is _not_ 17B.