logoalt Hacker News

yorwbayesterday at 3:03 PM1 replyview on HN

There are ways to shard the model that require a lot of off-chip bandwidth, but there are also ways that don't. The only data that needs to be passed between layers is the residual stream, which requires much less bandwidth than the layer weights and KV cache, and you already need about that much bandwidth to get input tokens in and output tokens out. So putting different layers on different chips isn't that terrible.

Importantly, Cerebras is offering many models that can't possibly fit on just a single chip, so they have to use some kind of sharding to get them to work at all. You could imagine an even bigger chip that can fit the entire model and run it even faster, but they have to work with what can be manufactured with current technology.


Replies

LtdJorgeyesterday at 10:32 PM

Yeah but SRAM is stupidly fast. Even compared to DRAM. Going to a different chip, no matter if over a custom interconnect takes a "lifetime", specially since you don’t have L4 caching tricks, as you’re already at SRAM level.