I'd expect it to require too much RAM bandwidth to be feasible. RAM is really slow at silicon...

knorker • today at 3:30 PM • 1 reply • view on HN

I'd expect it to require too much RAM bandwidth to be feasible.

RAM is really slow at silicon speeds. Very little is reachable in one clock cycle, unless the clock cycle is abysmally slow.

Replies

danbruc • today at 3:38 PM

No RAM. Instead of having a general purpose multiplier that multiplies an input with a weight stored in RAM, just have a multiplier that hardcodes the weight. In some sense replace each weight with a specialized multiplier and wire them together with accumulators and activation functions in between. And some registers for pipelining. If one goes for four bit quantization, one could have sixteen optimized multipliers, one for each possible weight, and the one just selects and connects them according to the model weights and structure.

Example. If you have a neuron with 16 inputs each 8 bit wide and with a 4 bit weight per input, you will have 16 specialized multipliers each scaling its input by the corresponding weight and then the 16 scaled inputs feed into an adder tree and finally an activation function.

alt Hacker News

Replies