LLM speed is roughly <memory_bandwidth> / <model_size> tok/s. DDR4 tops out ...

tgrowazay • last Saturday at 11:59 PM • 5 replies • view on HN

LLM speed is roughly <memory_bandwidth> / <model_size> tok/s.

DDR4 tops out about 27Gbs

DDR5 can do around 40Gbs

So for 70B model at 8 bit quant, you will get around 0.3-0.5 tokens per second using RAM alone.

Replies

someguy2026 • yesterday at 12:17 AM

DRAM speeds is one thing, but you should also account for the data rate of the PCIe bus (and/or VRAM speed). But yes, holding it "lukewarm" in DRAM rather than on NVMe storage is obviously faster.

uf00lme • yesterday at 12:47 AM

Channels matter a lot, quad channel ddr4 is going to beat ddr5 in dual channel most of the time.

➕ show 1 reply

xaskasdf • yesterday at 1:28 AM

yeah, actually, I'm bottlenecked af since my mobo got pcie3 only :(

vlovich123 • yesterday at 12:13 AM

Faster than the 0.2tok/s this approach manages

zozbot234 • yesterday at 12:28 AM

Should be active param size, not model size.

alt Hacker News

Replies