> Until there is some drastic new hardware For inference, there is already a 10x improvement po...

jiggawatts • today at 4:50 AM • 1 reply • view on HN

> Until there is some drastic new hardware

For inference, there is already a 10x improvement possible over a setup based on NVIDIA server GPUs, but volume production, etc... will take a while to catch up.

During inference the model weights are static, so they can be stored in High Bandwidth Flash (HBF) instead of High Bandwidth Memory (HBM). Flash chips are being made with over 300 layers and they use a fraction of the power compared to DRAM.

NVIDIA GPUs are general purpose. Sure, they have "tensor cores", but that's a fraction of the die area. Google's TPUs are much more efficient for inference because they're mostly tensor cores by area, which is why Gemini's pricing is undercutting everybody else despite being a frontier model.

New silicon process nodes are coming from TSMC, Intel, and Samsung that should roughly double the transistor density.

There's also algorithmic improvements like the recently announced Google TurboQuant.

Not to mention that pure inference doesn't need the crazy fast networking that training does, or the storage, or pretty much anything other than the tensor units and a relatively small host server that can send a bit of text back and forth.

Replies

zozbot234 • today at 5:08 AM

> Flash chips are being made with over 300 layers and they use a fraction of the power compared to DRAM.

Isn't reading from flash significantly more power intensive than reading DRAM? Anyway, the overhead of keeping weights in memory becomes negligible at scale because you're running large batches and sharding a single model over large amounts of GPU's. (And that needs the crazy fast networking to make it work, you get too much latency otherwise.)

➕ show 1 reply

alt Hacker News

Replies