How you can do dev in 2026 using 64k context and without sub agents?
The benchmark seemed fine until I saw that.
If you use sub agents, they will overwrite the cache and each request will trigger full reprocessing. Have fun with that as it will crash the t/s metrics on each prefill on top of the max 64k including input + output is a major blocker.
If you push the context higher and add parallel slots the requirements will be far higher and the numbers less shiny.
I have 24GB of VRAM (via a RTX 4090) and run Qwen3.6-35b:iq4, so it's importance-aware quantization and isn't nearly as dumb as it sounds like, fitting the 35b into 18 GB so you have some left over. So far I've had no issues, other than it taking a while for things like image gen, which I found out if you're gonna do with any alacrity, just have a cloud model do it.
For anything else local, including writing some automation scripts and such, it works great.
why does everyone imply you need a $10k laptop which then starts burning when you run Qwen 3.6? Get any other system with enough VRAM for a third of the price. Framework Desktop (Strix Halo 128GB) still costs under 4k nowadays, is nearly silent even on 100% GPU + CPU. (also it gets only slightly 'warm', but with a desktop you don't care anyway, I guess).
I've tested it extensively for actual local development for my job, and hard disagree here. It's a waste of time to use it. Wish it were not true.
I just burned 20 minutes because I wanted to play hex minesweeper: https://hexabomb.pgpln.app
Source: https://chatgpt.com/share/6a42dd8a-4e28-83e8-9ef7-6ba56d665c...
Help me improve local model performance with petsitter!
It basically exploits the face that time can be traded for intelligence with local models
Has anyone managed to cleanly integrate Web search into local models (run with llama.cpp)? The biggest limitation of the class of models that fit into one or two consumer GPUs is that they lack world knowledge, but presumably this can be remedied by enabling access to use the Internet.
How does llama.cpp use the GPU efficiently as opposed to MLX?
Is there any way to use MLX and GPU at the same time? Or does memory become a big problem?
TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software.
If I can generate voice at the same time as video, that would be useful.
Any chance to run this on a RTX 3090 and 64GB of regular RAM with decent context size?
I would like to offer someone the next openclaw: a GUI for the mac that allows people to manage and install local models with a single click, provides GUI tools for tweaking important aspects of them, and also provides a good command line interface to those models.
I honestly don't get the hostility against local models in this thread (and in some other threads recently).
I haven't seen anyone make an argument they are as good as SotA (OpenAI, Anthropic). It's just they are approaching state where they are "as good" for some _limited_ set of use cases. Which will allow us to resolve 2 primary issues with these SotA models: privacy and vendor lock-in. Plus, they're very useful for education purposes, you get to explore what things looks like under the hood, play with various models, tools, maybe put something simple together yourself.
You get Macbook - great. You got gaming rig with a decent GPU - great (set it up as a dedicated server that you connect to through simple REST).
What exactly is wrong with any of that?
Tried looking at it, but needs a much beefier machine than I have RN.
Hopefully we're looking at a future where local models become more & more realistic to use for reducing remote AOI spend.
In hindsight, the Mac 512gb for about $10k was a total steal given that to run GLM 5.2 you need a 4x H100 to get the necessary amount of VRAM. Yeah the h100 is 2 to 8 times faster, but it's $20k a month to rent a 4xH100 VPS.
What model fits on 36GB RAM mac?
Checkout details on what this runs on for local AI here: https://tokenstead.ai/models/qwen3-6-27b
Lost count of number of times I read this or similar:
For me it’s the first local model that actually makes sense as a general intelligence.
3.5 122B is much better. 27 B is bad at Long context and Svelte
hi guys... I run specialized quants on my 24gb air.. (I specialize in 3-bit quants that punch above their weight).. try out my version of 3.6-27b I think you be impressed https://huggingface.co/macwhisperer/Qwen3.6-27B-SuperDense
Running this model on a 48 GB memory MacBook Pro when offline, it performs its tasks, but of course, it’s slower than Claude or Codex.
Best way to make your M series macbook pro feel like a good old fashion intel macbook pro. Run a local model.
On AMD R9700, I'm getting ~90 t/s with 35b MTP variant and ~40t/s with dense 27b MTP
I'm using Qwen and Gemma 4 locally and it's pretty good stuff, not frontier level but gets the job done.
Its feasible but that laptop is not available for most devs.
I do have access for a 64 gb ram mac mini but most people don't.
Is qwen finetuned/RL'd on any agent harness? Or does it just work well enough off the bat with opencode?
Strix Halo user here. While Qwen 3.6 27B exhibits remarkable intelligence density, I will still take unsloth's dynamic IQ2_XXS of Minimax M2.7 over Q8_0 Qwen 3.6 27B any day of the week, and this isn't just because of generation speed either. I wrote my own custom harness, and I get hallucinated tool call parameters and bizarre invocations with Q3.6 27B even at Q8_0, but no issues with the IQ2_XXS of M2.7.
also for those with only 16gb-- try this model https://huggingface.co/macwhisperer/Gemma4-12B-SuperDense its exceptional!
qwen 3.6 27b and qen35b a3b work like magic, if we get dpark speculative decoding versions of these models it will further improve the throughput
What's the minimum requirement for a Nvidia card to run it? For let's say 10 t/s.
Yup, been rocking theQwen3.6-35B-A3B-MTP-GGUF locally with 88tk/s it's amazing.
>Real work
This part should have featured something about real work. But instead it features a paragraph about one-shot bs that creates "something".
Unless your work is to create thousands wordpress tremplates to sell - this is not a "real work".
Give it a repository (any kind of OSS project will do for an example) and a github issue requesting a knew feature or describing a confirmed bug. (you can and probably should write a prompt for LLM shough, don't just provide the issue itself)
And then whatch it go.
And then judge the result and it's quality.
Sorry, but from my experience 27B is just useless. You do get a result and some times it does work, but most of the times it is not event on junior dev level. And it takes it a lot of time to do the thing, unless you have an extremely expensive machine.
If I have 10k to spend, what should I buy for the best local model experience?
> I recommend llama.cpp - a direct, open source tool that allows running models on various devices. You don’t need Ollama, and frankly - I would recommend against using that on ethical grounds.
> https://sleepingrobots.com/dreams/stop-using-ollama/
I had faced roadblocks while integrating with openclaw using ollama (Was trying to experiment with `qwen3-vl:2b`). I was tracking the issue back to openclaw at that time, I didn't even consider investigating ollama.
I attached a threads post here where I'm talking to meta ai to expand on both scenarios (not to use ollama, but llama.cpp & my take on the why this is the way it is - ie. commercial gains)
FYI token speed is somewhat irrelevant for agentic development. You let it run, then you come back. The whole point is that it's asynchronous. If it takes 4 hours, 8 hours, 16 hours...who cares?
Qwen is so good a model.
I see OpenCode mentioned in the article, and I would strongly warn against using it for local development because it disrespects caching (the content of the first turn / system prompt is NOT stable). I use Pi which works much better.
Local models are great for a lot of things past just software development. We need to move towards solving other real world problems vs just building software. I've been focused on that with TxtAI (https://github.com/neuml/txtai) for 6 years now.
Gemma4 31B with MTP enabled is faster and I feel a bit stronger at coding. Either one can run in 32GB VRAM or unified RAM with some tuning (3 bit weights, 8 bit kv cache)
Qwen's new AgentWorld model is good too: https://huggingface.co/Qwen/Qwen-AgentWorld-35B-A3B
I'm running the NVFP4 alongside Gemma4 at the same quant on an OEM Spark
Very capable lora adapters are surfacing but it seems they are very niche.
When reading the comments, it seems that in the AI race to zero, Apple was already at the finish line. as predicted.
So it will be no surprise that there will be a time where everyone will be able to run a local model, say GLM 5.2 locally on their machine. Like it or not.
Hmm, i used it and it can't get past a simple coding test that 5.5 passes with light reasoning
none of these local models are good for development, complete waste of time. nobody has $100k+ hardware sitting around at home to actually run a good model
goat
Spent a week trying to get sensible results out of llama 3.3 At one point it even simulated doing the work, log output and everything and when I challenged it about the missing artefacts it actually started questioning my intelligence. Seems appropriate for a Zuck enterprise.
Qwen on the other hand got straight to work with astonishing competency on the same system.
From what I read llama3 needs beefier compute to reliably invoke tools, which I presume relates to it focussing more on simulating AGI rather than being a useful tool.
[flagged]
[flagged]
[flagged]
[dead]
Been running it on a 9950x3D with 96GB and a 4090. Speedwise it is fine. Bit while not completely useless, for software development it is unsurprisingly a dramatic downgrade from the Opus I use as my daily driver.