Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

176 points • by yu3zhou4 • yesterday at 7:38 PM • 16 comments • view on HN

Comments

README is in my opinion (author here) the most interesting - I wrote it to help others build useful mental model to be able to recreate the project yourself, without need to even read my code

➕ show 1 reply

tom-wal • today at 8:10 AM

I feel like I learned twice as much in 10 minutes reading this than I did reading LLM for Dummies. Thank you

xuanlin314 • today at 2:10 AM

The lesson-style README is a great approach. Breaking down LLM inference into digestible steps makes the codebase approachable even for people who haven't touched CUDA before.

dwa3592 • yesterday at 10:11 PM

Very nice job on read me.

>>Physically, LLM is a file which contains a lot of float numbers.

aka atoms of the LLM.

➕ show 1 reply

GoldenJade • today at 2:56 AM

Thanks for sharing this. As someone currently researching LLMs, I'm sure I'll be referencing this quite a bit going forward.

smy_smy • today at 4:13 PM

interesting!

juancn • yesterday at 9:42 PM

Looks interesting, it reminds me of the first llama.cpp, but better documented.

nazgulsenpai • yesterday at 8:41 PM

I love the documentation formatted in lessons. I can't wait to read through it.

cookiengineer • yesterday at 10:26 PM

Wanted to add that the author has an amazing blog with lots of interesting papers: https://jedrzej.maczan.pl/

sylware • today at 9:49 AM

I am looking at a plain and simple C implemented LLM inference, and/or x86_64 assembly implemented, and/or AMD GPU RDNA assembly.

Anybody?

einpoklum • yesterday at 10:13 PM

It seems the author believes checking the return values of CUDA API calls is not "tiny" enough :-(

pslab • today at 6:52 AM

[flagged]

alexpandey • today at 3:18 AM

[dead]

harshuljain13 • yesterday at 10:11 PM

[dead]

alt Hacker News

Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

Comments