logoalt Hacker News

Maxiousyesterday at 9:47 AM7 repliesview on HN

ICYMI unsloth has had some major breakthroughs today with the Qwen3.5 local models https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.


Replies

danielhanchenyesterday at 11:54 AM

Oh I didn't expect this to be on HN haha - but yes for our new benchmarks for Qwen3.5, we devised a slightly different approach for quantization which we plan to roll out to all new models from now on!

show 2 replies
Kayouyesterday at 9:53 AM

Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had

show 4 replies
roxolotlyesterday at 12:24 PM

What method are you using to do that? I’ve been playing with llama.cpp a lot lately and trying to figure out the cleanest options for getting a solid context window on 32gb vram and 64gb system ram.

show 1 reply
mirekrusinyesterday at 11:12 AM

2x RTX 4090, Q8, 256k context, 110 t/s

show 1 reply
cpburns2009yesterday at 1:37 PM

Does llama.cpp support Qwen3.5 yet? When I tried it before, it failed saying "qwen35moe" is an unsupported architecture.

show 2 replies
RS-232yesterday at 12:23 PM

That’s intriguing. I have the same card, maybe I should give it a go. Curious about your CPU/RAM/storage capacity as well.

Any resources for configuring the local setup?

My entire home media stack is a single compose file in a WSL distro so it would be cool if local LLM worked the same way.

jychangyesterday at 9:51 AM

Not really breakthroughs, more like bugfixes for their broken first batch.

show 1 reply