I wanna see an inference chip where the weights are part of the rom of the chip.
There would be 1 multiplier per weight (and since they're constant, the whole thing turns into a bunch of simple adders), and the total pipelined system throughput would be one token per clock cycle.
That means you can probably have millions of users simultaneously using a single bit of silicon, with perhaps 500 million tokens per second coming out the output bus.
Downside is this chip would be huuuuge - a whole wafer.
Wafer level faults probably won't matter though - neural nets are resistant to a few missing or wrong weights.
Due to the speed the industry moves, you'd want to race from model weights to production super fast, make 50 wafers, use them for a year, then bin them when that model is obsolete.
By the way, you've seen Cerebras? It's not gone as far as what you described - loads of cores and RAM but you still load up the weights onto it as software and they need to be streamed into the chip for large models - but it is a whole wafer.
>> I wanna see an inference chip where the weights are part of the rom of the chip.
I've been wondering about that for a while now. For a lot of tasks putting weights in ROM is probably OK. OTOH:
>> There would be 1 multiplier per weight...
I'm not sure that is a good idea. Maybe if its quantized down to 2 bits... Otherwise maybe a small ROM near each multiplier (or row of them or whatever) so the multipliers could handle N distinct matrix operations without having to move the data from far away.
Another fun thought is to have a row of MAC units on DRAM so a DRAM row would be a vector. Row size might be 64Kbit or 8K weights if they're 8bit. This also keeps the weights and calcs on the same chip. I'm not sure this would put enough multipliers on one chip though. Systolic arrays can have tens or hundreds of thousands each doing one op per clock cycle.
You don't need a single wafer, you can split the model into many smaller different chips and connect inputs/outputs.
Skip VHDL and directly go for GDSII / OASIS. Try to find similar vectors so you get re-usable blocks.
You can dynamically calibrate a chip by fine tuning output.
This may be extreme, or, completely stupid, but, why are we not using genetics to "grow" chips in a chemical soup yet? Similar to Verilog/VHDL, don't we have some similar language to express circuits using gene sequences?
> "Downside is this chip would be huuuuge - a whole wafer."
Why don't we have chips like that? If a CPU the size of a postage stamp can do x amount of performance, imagine how much performance you could get if you used an entire wafer of chips running in parallel. Obviously there would be certain use cases, like you couldn't fit an entire wafer in a phone, but still
One token per clock cycle at 1B parameters would imply 2 ExaFLOPS, consuming about 10 KWs
I've also been thinking about this. Although the forward pass of a transformer model also involves some heavier operations like normalization, reciprocals, exponentiations or other non-linearities (GeLU, SiLU) which may (though typically don't) involve learned weights as operands.
Supposedly memristors would be ideal for this (and it would be reprogrammable), but then again, memristors seem to be the carbon nanotubes of the computing world.
> weights [as] part of the rom of the chip
Not really that: you are pointing to Compute-In-Memory (CIM) - techniques where the data (here, a multiplier value) is part of the processor (here, the multiplying circuit).
The problem of "fetch and process" is bypassed completely architecturally: the data is there where the processing happens - it's not moved, there is no latency.
firmware upgrade would mean flashing a huge BIN file.
How would the pipelining work when the next token depends on the last token?
“ Wafer level faults probably won't matter though - neural nets are resistant to a few missing or wrong weights.”
Brain science people “love” traumatic brain injury cases because it can help explore what happens when bits of the “brain wafer” get damaged. We’ve learned a lot from such things.
I wonder if people are intentionally “destroying” parts of the model weights to learn more about what happens? Like could you strategically wipe a gig of the model so it’s “all zeros” and see what happens?
I have to wonder
this appeared some time ago, https://taalas.com/, but I'm sure there's others thinking these same thoughts. this would be best for small models imo, nothing frontier because that changes too fast