Read the headline and thought it rescaled LLMs down for your hardware. That would be fascinating but...

api • yesterday at 11:57 AM • 1 reply • view on HN

Read the headline and thought it rescaled LLMs down for your hardware. That would be fascinating but would degrade performance.

Any work on that? Like let’s say I have 64GB memory and I want to run a 256 parameter model. At 4 bit quantized that’s 128 gigs and usually works well. 2 bits usually degrades it too much. But if you could lose data instead of precision? Would probably imply a fine tuning run afterword, so very compute intensive.

Replies

riidom • yesterday at 8:10 PM

LM Studio has an option on model load that I believe does what you describing here: "K Cache Quantization Type" (and similar for "V"). It's marked as experimental and says the effect is basically hard to predict. Never tried myself, though.

alt Hacker News

Replies