Right-sizes LLM models to your system's RAM, CPU, and GPU

253 points • by bilsbie • yesterday at 11:15 PM • 61 comments • view on HN

Comments

This pretty cool, and useful but I only wish this was a website. I don’t like the idea of running an executable for something that can perfectly be done as a website. (Other than some minor features, tbh even you can enable Corsair and still check the installed models from a web browser).

Sounds like a fun personal project though.

➕ show 5 replies

omneity • today at 8:38 AM

This is a great project. FYI all you need is the size of an LLM and the memory amount & bandwidth to know if it fits and the tok/s

It’s a simple formula:

llm_size = number of params * size_of_param

So a 32B model in 4bit needs a minimum of 16GB ram to load.

Then you calculate

tok_per_s = memory_bandwidth / llm_size

An RTX3090 has 960GB/s, so a 32B model (16GB vram) will produce 960/16 = 60 tok/s

For an MoE the speed is mostly determined by the amount of active params not the total LLM size.

Add a 10% margin to those figures to account for a number of details, but that’s roughly it. RAM use also increases with context window size.

➕ show 1 reply

mittermayr • today at 9:55 AM

this is visually fantastic, but while trying this out, it says I can't run Qwen 3.5 on my machine, while it is running in the background currently, coding. So, not sure what the true value of a tool like this is other than getting a first glimpse, perhaps. Also, with unsloth providing custom adjustments, some models that are listed as undoable become doable, and they're not in the tool. Again, not trying to be harsh, it's just a really hard thing to do properly. And like many other similar tools, the maintainer here will also eventually struggle with the fact that models are popping up left and right faster than they can keep up with it.

kamranjon • today at 4:02 AM

This is a great idea, but the models seem pretty outdated - it's recommending things like qwen 2.5 and starcoder 2 as perfect matches for my m4 macbook pro with 128gb of memory.

lacoolj • today at 7:53 PM

As a few others have noted already - this should just be a website, not a CLI tool. We can easily enter our CPU, RAM, GPU specs into a form to get this info.

est • today at 5:11 AM

Why do I need to download & run to checkout?

Can I just submit my gear spec in some dropdowns to find out?

0xbadcafebee • today at 2:46 PM

This is probably catching ~85% of cases and you can possibly do better. For example, some AMD iGPUs are not covered by ROCm, so instead you rely on Vulkan support. In that case you can sometimes pass driver arguments to allow the driver to use system RAM to expand VRAM, or to specify the "correct" VRAM amount. (on iGPUs the system RAM and VRAM are physically the same thing) In this case you carefully choose how much system RAM to give up, and balance the two carefully (to avoid either OOM on one hand, or too little VRAM on the other). But do this and you can pick models that wouldn't otherwise load. Especially useful with layer offload and quantized MoE weights.

minchok • today at 8:58 PM

Thanks, it is helpful and easy to use!

andsoitis • today at 4:29 AM

Claude is pretty good at among recommendations if you input your system specs.

windex • today at 5:58 AM

What I do is i ask claude or codex to run models on ollama and test them sequentially on a bunch of tasks and rate the outputs. 30 minutes later I have a fit. It even tested the abliterated models.

castral • today at 4:14 AM

I wish there was more support for AMD GPUs on Intel macs. I saw some people on github getting llama.cpp working with it, would it be addable in the future if they make the backend support it?

ff00 • today at 6:35 AM

Found this website, not tested https://www.caniusellm.com/

➕ show 2 replies

asimovDev • today at 7:20 AM

as someone who's very uneducated when it comes to LLMs I am excited about this. I am still struggling to understand correlation between system resources and context, e.g how much memory i need for N amount of context.

Been recently into using local models for coding agents, mostly due to being tired of waiting for gemini to free up and it constantly retrying to get some compute time on the servers for my prompt to process like you are in the 90s being a university student and have to wait for your turn to compile your program on the university computer. Tried mistral's vibe and it would run out of context easily on a small project (not even 1k lines but multiple files and headers) at 16k or so, so I slammed it at the maximum supported in LM studio, but I wasn't sure if I was slowing it down to a halt with that or not (it did take like 10 minutes for my prompt to finish, which was 'rewrite this C codebase into C++')

manmal • today at 5:30 AM

Slightly tangential, I‘m testdriving an MLX Q4 variant of Qwen3.5 32B (MoE 3B), and it’s surprisingly capable. It’s not Opus ofc. I‘m using it for image labeling (food ingredients) and I‘m continuously blown away how well it does. Quite fast, too, and parallelizable with vLLM.

That’s on an M2 Max Studio with just 32GB. I got this machine refurbed (though it turned out totally new) for €1k.

railka • today at 2:01 PM

Congratulations on the launch! It's useful for Ollama users, for example. And LM Studio has built-in hints in the interface.

AndrewAndrewsen • today at 8:58 AM

Awesome project! I recently ran a (semi-)crowdsourced quality benchmarking for models ≤20b

How do you benchmark them? This would be awesome to implement at the page as well. I will link to this project at https://mlemarena.top/

sneilan1 • today at 5:08 AM

This is exactly what I needed. I've been thinking about making this tool. For running and experimenting with local models this is invaluable.

throwaway2027 • today at 12:00 PM

More params and lower quant or higher quant and less params?

➕ show 1 reply

dotancohen • today at 4:30 AM

In the screenshots, each model has a use case of General, Chat, or Coding. What might be the difference between General and Chat?

➕ show 1 reply

fwipsy • today at 4:06 AM

Personally I would have found a website where you enter your hardware specs more useful.

➕ show 4 replies

api • today at 11:57 AM

Read the headline and thought it rescaled LLMs down for your hardware. That would be fascinating but would degrade performance.

Any work on that? Like let’s say I have 64GB memory and I want to run a 256 parameter model. At 4 bit quantized that’s 128 gigs and usually works well. 2 bits usually degrades it too much. But if you could lose data instead of precision? Would probably imply a fine tuning run afterword, so very compute intensive.

➕ show 1 reply

esafak • today at 5:18 AM

I think you could make a Github Page out of this.

genie3io • today at 8:30 AM

[dead]

alt Hacker News

Right-sizes LLM models to your system's RAM, CPU, and GPU

Comments