We actually found the Mistral Small 4, quantized to 4bit was comparable to Qwen 3.6 27B and is rough...

lettergram • yesterday at 6:04 PM • 1 reply • view on HN

We actually found the Mistral Small 4, quantized to 4bit was comparable to Qwen 3.6 27B and is roughly the same size. At least from our experience on our use cases, the quantization of the Mistral model worked far better than trying to quantize the Qwen family.

Fully agree to your point though, Mistral in general is far behind where I'd expect and Qwen in particular is crushing it at the smaller sizes.

Personally, I'd consider anything 20B params and above a "medium" model. Small being <20B and large >100B. I think obviously we can get to the huge 1-2T param models, but frankly the margin of accuracy improvement for the speed hit is kinda insane (1-2% for many metrics).

Replies

rhdunn • yesterday at 8:22 PM

It's all relative. For local use I'd classify it by hardware (VRAM size) using FP8 or Q6 quantization:

1. tiny <2-3B -- easily runnable on lower-spec hardware

2. small 4-8B -- runnable on 8GB GPUs

3. medium 9-12B -- runnable on 12GB GPUs

4. large 13-24B -- runnable on 16GB (for the lower end models) and 24GB GPUs

5. very large 25-32GB -- runnable on 32GB GPUs

6. huge >32GB -- not easily runnable on consumer GPUs without compromising performance (offloading layers to the CPU/RAM), quality (heavy quantization, esp. at <= Q4), or price (investing in multi-GPU setups and/or server-grade hardware).

You could possibly split huge down further, as 70GB models (e.g. llama 3) are easier to get working than >120GB models and 1TB models are completely intractable.

➕ show 1 reply

alt Hacker News

Replies