I have spent a HUGE amount of time the last two years experimenting with local models.
A few lessons learned:
1. small models like the new qwen3.5:9b can be fantastic for local tool use, information extraction, and many other embedded applications.
2. For coding tools, just use Google Antigravity and gemini-cli, or, Anthropic Claude, or...
Now to be clear, I have spent perhaps 100 hours in the last year configuring local models for coding using Emacs, Claude Code (configured for local), etc. However, I am retired and this time was a lot of fun for me: lot's of efforts trying to maximize local only results. I don't recommend it for others.
I do recommend getting very good at using embedded local models in small practical applications. Sweet spot.
What about running e.g. Qwen3.5 128B on a rented RTX Pro 6000?
What kind of hardware did you use? I suppose that a 8GB gaming GPU and a Mac Pro with 512 GB unified RAM give quite different results, both formally being local.
I've been really interested in the difference between 3.5 9b and 14b for information extraction. Is there a discernible difference in quality of capability?
I'd love to know how you fit smaller models into your workflow. I have an M4 Macbook Pro w/ 128GB RAM and while I have toyed with some models via ollama, I haven't really found a nice workflow for them yet.