I'm surprised no one has else has mentioned - low power mode.
With no speculative decoding, using high power mode, I get 80 t/s on 35B A3B - and it gets hot and spins up. On low power mode I get 38 t/s - no fans, cool to warm laptop.
If you currently don't use speculative decoding and you start using it, it can nearly offset the difference between high and low power, and it's night and day experience.
I almost always keep my laptop on low power mode.
Can you mention what inference stack you're using? I've tried MTP several times with that model and it always seems to significantly cut my token generation speed from ~60 tokens/sec to ~40 (M3 Max).
Will give this a try later. Enjoy working with A3B Coder, but the heat coming out my 32gb M5 is a lot. This might be the trick - Thanks!
It is less efficient use of the GPU and uses more electricity overall, no?
Awesome idea! Will try it out. Wish there was a way to enable low power on a per-app basis. Scrolling and reading on low power mode is really annoying.