logoalt Hacker News

kgeistyesterday at 6:25 PM7 repliesview on HN

Heh, I made something very similar for the Qwen3 models a while back. It only runs Qwen3, supports only some quants, loads from GGUF, and has inference optimized by Claude (in a loop). The whole thing is compact (just a couple of files) and easy to reason about. I made it for my students so they could tinker with it and learn (add different decoding strategies, add abliteration, etc.). Popular frameworks are large, complex, and harder to hack on, while educational projects usually focus on something outdated like GPT-2.

Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.

The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.


Replies

Aurornisyesterday at 10:10 PM

> what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination?

The inference engines in use already include different backend building blocks optimized for different hardware.

While there are places where you can pick up some low hanging fruit for less popular platforms, there isn't a lot of room to squeeze in super optimized model-runners for specific GPU families and get much better performance. The core computations are already done by highly optimized kernels for each GPU.

There are forks of llama.cpp that have better optimizations for running on CPU architectures, but (barring maintainer disagreements) a better use of time is to target merging these improvements upstream instead of trying to make super specific model+GPU runners.

show 2 replies
egeskotoday at 9:55 AM

I'll add to this: What if chips were designed for the model? What would happen if we moved from digital to analog (vectors are not represented as bits, but instead as voltages)? Could the compute heavy matrix multiplications be done via op-amps? And could this analog approach be way more efficient than the limitations of bit representation?

xtractoyesterday at 7:37 PM

This takes me to the famous FizzBuzz High performance codegolf answer [1]. If we could implement optimizations like that for the inferences, maybe we could increase the speeds 10x or more.

[1] https://codegolf.stackexchange.com/questions/215216/high-thr...

show 1 reply
mirsadmyesterday at 7:27 PM

I've built something like this. One issue is that LLMs are actually terrible at writing good shaders. I've spent way too much time trying to get them not to be so awful at it.

show 1 reply
joshmarlowyesterday at 6:58 PM

Another suggestion for optimizing local inference - the Hermes team talks a lot on X about how much better results are when you use custom parsers tuned to the nuances of each model. Some models might like to use a trailing `,` in JSON output, some don't - so if your parser can handle the quirks of the specific model, then you get higher-performing functionality.

didipyesterday at 9:17 PM

What if PyTorch is extended to have a pluggable compiler? For M GPU types and N models, if the backend allows, run a specialized compiler?

p_stuart82today at 1:30 AM

this feels closer to ATLAS/FFTW than a model runner. the generated kernel ages out, the tuning harness is the bit you actually want to keep.