ICYMI unsloth has had some major breakthroughs today with the Qwen3.5 local models

Maxious • yesterday at 9:47 AM • 7 replies • view on HN

ICYMI unsloth has had some major breakthroughs today with the Qwen3.5 local models https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.

Replies

danielhanchen • yesterday at 11:54 AM

Oh I didn't expect this to be on HN haha - but yes for our new benchmarks for Qwen3.5, we devised a slightly different approach for quantization which we plan to roll out to all new models from now on!

➕ show 2 replies

Kayou • yesterday at 9:53 AM

Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had

➕ show 4 replies

roxolotl • yesterday at 12:24 PM

What method are you using to do that? I’ve been playing with llama.cpp a lot lately and trying to figure out the cleanest options for getting a solid context window on 32gb vram and 64gb system ram.

➕ show 1 reply

mirekrusin • yesterday at 11:12 AM

2x RTX 4090, Q8, 256k context, 110 t/s

➕ show 1 reply

cpburns2009 • yesterday at 1:37 PM

Does llama.cpp support Qwen3.5 yet? When I tried it before, it failed saying "qwen35moe" is an unsupported architecture.

➕ show 2 replies

RS-232 • yesterday at 12:23 PM

That’s intriguing. I have the same card, maybe I should give it a go. Curious about your CPU/RAM/storage capacity as well.

Any resources for configuring the local setup?

My entire home media stack is a single compose file in a WSL distro so it would be cool if local LLM worked the same way.

jychang • yesterday at 9:51 AM

Not really breakthroughs, more like bugfixes for their broken first batch.

➕ show 1 reply

alt Hacker News

Replies