I see that going around, and either the test cases are too simplistic or I'm doing something wr...

gchamonlive • today at 10:46 AM • 10 replies • view on HN

I see that going around, and either the test cases are too simplistic or I'm doing something wrong. I have a server with a 3090 in it, enough to run qwen3.6, but I haven't had much luck using it with either codex or oh-my-pi. They work, but the model gets really slow with ~64k context and the attention degrades quickly. You'll sometimes execute a prompt, the model will load a test file and say something like "I was presented with a test file but no command. What should I do with it?".

So yeah, while it's true that qwen3.6 is good for agentic coding, it's not very good for exploring the codebase and coming up with plans. You need to pair it today with a model capable of ingesting the whole context and providing a detailed plan, and even then the implementation might take 10x the amount of time it'd take for sonnet or Gemini 3 to crunch through the plan.

EDIT:

My setup is really as simple as possible. I run ollama on a remote server on my local network. In my laptop I set OLLAMA_HOST and do `ollama pull qwen3.6:27b`, which then becomes available to the agent harnesses. I am not sure now how I set the context, but I think it was directly in oh-my-pi. So server config- and quantization-wise, it's the defaults.

Replies

neutrinobro • today at 1:57 PM

I have a old supermicro X10DRU-i server with two Tesla V100's (48 GB VRAM) and 128GB RAM and have been running qwen3.6-27B with a lot of success. I would say it's performance on my use case (modifying and extending a ~70kloc C++ code base) has been excellent. I have no benchmarks, but it seems comparable to claude sonnet 4.6 in capabilities. I run it with llama.cpp:

llama-server -m Qwen3.6-27B-Q8_0.gguf -c 131072 --tensor-split 0.4,0.6 --batch-size 256 --cont-batching --flash-attn on -ngl 999 --threads 16 --jinja

I regularly get ~22tok/s when context utilization is below <65k, but it does slow done to ~13tok/s when the context is nearly full (lots of swapping to RAM). I have been using the qwen-code harness though, since it is far more token efficient than claude-code which injects massive prompts that chew up the context window. I plan on trying it with pi next.

I'm keeping my ~$20/mo claude subscripts for the planning prompts, and then hand it off to qwen for implementation. It's been working well so far.

pferdone • today at 11:31 AM

I can see that and I don't know your setup, but there are people pushing >70t/s with MTP on a single 3090, with big contexts still >50t/s. 64k is not a lot for agentic coding, and IIRC 128k with turboquant and the likes should be possible for you. r/LocalLLM/ and r/LocalLLaMA/ are worth a visit IMO.

EDIT: just found this recipe repo, may wanna give it a go: https://github.com/noonghunna/club-3090

EDIT-2: this can also shave off a lot of context need for tool calling -> https://github.com/rtk-ai/rtk

➕ show 2 replies

simjnd • today at 1:16 PM

This link [1] features some good insight on how to adapt your usage to smaller models which require more explicit or deliberate prompting. I have been using Gemma 4 31B a lot and have found it very competent. It can be a bit unstable and start spiraling or end up in infinite loops that you need to reset, but for the most part it's been really good.

[1]: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-...

➕ show 1 reply

dminik • today at 11:41 AM

Yeah. Context size matters a lot. With OpenCode dumping like 10k tokens in the system prompt it takes like 4 rounds before it had to compact at say 64k. It's not really worth it to run at anything below 100k and even then the models aren't all that useful.

They're also pretty terrible at summarization. Pretty much always some file read or write in the middle of the task would cross the context margin and it would mark it as completed in the summary. I think leaving the first prompt as well as the last few turns intact would improve this issue quite a lot, but at low context sizes thats pretty much the whole context ...

mark_l_watson • today at 1:30 PM

When you run ollama serve, make sure you override the context size to about 32K. Also, I give the model a useful short README.md on the code it is writing or modifying, and a Makefile with useful targets for the agent to use. I usually use Claude Code with qwen3.6

I also go outside for fresh air while I wait for a session to run.

embedding-shape • today at 11:23 AM

You're not sharing what quantization you're using, in my experience, anything below Q8 and less than ~30B tends to basically be useless locally, at least for what you typically use codex et al for, I'm sure it works for very simple prompts.

But as soon as you go below Q8, the models get stuck in repeating loops, get the tool calling syntax wrong or just starts outputting gibberish after a short while.

➕ show 1 reply

2ndorderthought • today at 12:29 PM

I see your updated post. Switch over to llamacpp and look up recommended quants and settings. A good place for this info is on /r/localllama

➕ show 1 reply

nixon_why69 • today at 11:57 AM

Qwen3.6 supports 266k context out of the box. Try using q8 kv cache to enable more of it.

➕ show 1 reply

2ndorderthought • today at 10:55 AM

I agree for planning it's not there yet. But I wouldn't be surprised if something came out that was in a similar weight class.

regexorcist • today at 11:19 AM

Try oh-my-openagent plan mode.

alt Hacker News

Replies