logoalt Hacker News

Stagnantyesterday at 7:50 PM2 repliesview on HN

I've been using unsloth/gemma-4-31B-it-qat-GGUF daily for various small parsing and programming tasks using opencode and llama-server's front end. The past couple of weeks have made a big difference after google released the QAT variant and llama.cpp got support for MTP which means it is possible to now get 60-80 Tok/s with RTX 4090. The model fits in VRAM comfortably enough to keep it loaded even while browsing and having multiple programs.


Replies

amdiviayesterday at 8:11 PM

110+ Tok/s as another data point on the RTX 5090 (Gemma 4 31B QAT + MTP at UD-Q4_K_XL) (at peak used 27 GB of vram)

The real lovely thing was getting 300+ Tok/s (Gemma 4 26B QAT + MTP at UD-Q4_K_XL) (at peak, I think I saw vram usage reach 21 GB of vram)

lmedinasyesterday at 9:07 PM

the problem of that setup is that it will run out of context pretty quick. So for coding agent it will limit your workflow very fast.