No RAM. Instead of having a general purpose multiplier that multiplies an input with a weight stored in RAM, just have a multiplier that hardcodes the weight. In some sense replace each weight with a specialized multiplier and wire them together with accumulators and activation functions in between. And some registers for pipelining. If one goes for four bit quantization, one could have sixteen optimized multipliers, one for each possible weight, and the one just selects and connects them according to the model weights and structure.
Example. If you have a neuron with 16 inputs each 8 bit wide and with a 4 bit weight per input, you will have 16 specialized multipliers each scaling its input by the corresponding weight and then the 16 scaled inputs feed into an adder tree and finally an activation function.
No RAM. Instead of having a general purpose multiplier that multiplies an input with a weight stored in RAM, just have a multiplier that hardcodes the weight. In some sense replace each weight with a specialized multiplier and wire them together with accumulators and activation functions in between. And some registers for pipelining. If one goes for four bit quantization, one could have sixteen optimized multipliers, one for each possible weight, and the one just selects and connects them according to the model weights and structure.
Example. If you have a neuron with 16 inputs each 8 bit wide and with a 4 bit weight per input, you will have 16 specialized multipliers each scaling its input by the corresponding weight and then the 16 scaled inputs feed into an adder tree and finally an activation function.