I have an M4 Max with 48GB RAM. Anyone have any tips for good local models? Context length? Using th...

domh • today at 11:06 AM • 8 replies • view on HN

I have an M4 Max with 48GB RAM. Anyone have any tips for good local models? Context length? Using the model recommended in the blog post (qwen3.5:35b-a3b-coding-nvfp4) with Ollama 0.19.0 and it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world". Is this the best that's currently achievable with my hardware or is there something that can be configured to get better results?

Replies

zozbot234 • today at 11:18 AM

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

Qwen thinking likes to second-guess itself a LOT when faced with simple/vague prompts like that. (I'll answer it this way. Generating output. Wait, I'll answer it that way. Generating output. Wait, I'll answer it this way... lather, rinse, repeat.) I suppose this is their version of "super smart fancy thinking mode". Try something more complex instead.

➕ show 2 replies

functional_dev • today at 2:39 PM

I did not know, that NVFP4 was handled at the silicon level... until I dug deeper here - https://vectree.io/c/llm-quantization-from-weights-to-bits-g...

➕ show 1 reply

Octoth0rpe • today at 11:11 AM

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

That's not an unsurprising result given the pretty ambiguous query, hence all the thinking. Asking "write a simple hello world program in python3" results in a much faster response for me (m4 base w/ 24gb, using qwen3.6:9b).

fooker • today at 1:42 PM

Avoid reasoning models in any situation where you have low tokens/second

EagnaIonat • today at 2:47 PM

When MLX comes out you will see a huge difference. I currently moved to LMStudio as it currently supports MLX.

kylehotchkiss • today at 5:34 PM

I made my M2 Max generate a biryani recipe for me last night with 64gb ram and the baseline qwen3.5:35b model. I used the newest ollama with MLX.

https://gist.github.com/kylehotchkiss/8f28e6c75f22a56e8d2d31...

Under 3 minutes to get all that. The thinking is amusing, my laptop got quite warm, but for a 35b model on nearly 4 year old hardware, I see the light. This is the future.

xienze • today at 11:31 AM

Well, two things. First, “hi” isn’t a good prompt for these thinking models. They’ll have an identity crisis trying to answer it. Stupid, but it’s how it is. Stick to real questions.

Second, for the best performance on a Mac you want to use an MLX model.

➕ show 1 reply

hbbio • today at 11:17 AM

[dead]

alt Hacker News

Replies