I've tried to use a local LLM on an M4 Pro machine and it's quite painful. Not surprised that people into LLMs would pay for tokens instead of trying to force their poor MacBooks to do it.
Qwen 3.5 35B-A3B and 27B have changed the game for me. I expect we'll see something comparable to Sonnet 4.6 running locally sometime this year.
I’m super happy with it for embedding, image recog, and semantic video segmentation tasks.
What are the other specs and how's your setup look? You need a minimum of 24GB of RAM for it to run 16GB or less models.
Local LLM inference is all about memory bandwidth, and an M4 pro only has about the same as a Strix Halo or DGX Spark. That's why the older ultras are popular with the local LLM crowd.