I'm surprised no one has else has mentioned - low power mode. With no speculative decoding, u...

jasonjmcghee • today at 4:22 AM • 4 replies • view on HN

I'm surprised no one has else has mentioned - low power mode.

With no speculative decoding, using high power mode, I get 80 t/s on 35B A3B - and it gets hot and spins up. On low power mode I get 38 t/s - no fans, cool to warm laptop.

If you currently don't use speculative decoding and you start using it, it can nearly offset the difference between high and low power, and it's night and day experience.

I almost always keep my laptop on low power mode.

Replies

html5cat • today at 7:07 AM

Awesome idea! Will try it out. Wish there was a way to enable low power on a per-app basis. Scrolling and reading on low power mode is really annoying.

➕ show 1 reply

anon373839 • today at 5:05 AM

Can you mention what inference stack you're using? I've tried MTP several times with that model and it always seems to significantly cut my token generation speed from ~60 tokens/sec to ~40 (M3 Max).

c16 • today at 8:08 AM

Will give this a try later. Enjoy working with A3B Coder, but the heat coming out my 32gb M5 is a lot. This might be the trick - Thanks!

mycall • today at 6:52 AM

It is less efficient use of the GPU and uses more electricity overall, no?

➕ show 2 replies

alt Hacker News

Replies