logoalt Hacker News

apiyesterday at 11:57 AM1 replyview on HN

Read the headline and thought it rescaled LLMs down for your hardware. That would be fascinating but would degrade performance.

Any work on that? Like let’s say I have 64GB memory and I want to run a 256 parameter model. At 4 bit quantized that’s 128 gigs and usually works well. 2 bits usually degrades it too much. But if you could lose data instead of precision? Would probably imply a fine tuning run afterword, so very compute intensive.


Replies

riidomyesterday at 8:10 PM

LM Studio has an option on model load that I believe does what you describing here: "K Cache Quantization Type" (and similar for "V"). It's marked as experimental and says the effect is basically hard to predict. Never tried myself, though.