> Things like continuing education so your model knows about the latest NPM packages or world news is super important, but seems like it would require new chips.
They probably have a few ideas around that. Me, personally, I'd have one main expensive chip (replaced every 10 years, or whatever), with a secondary cheap chip in front of it that gets replaced every year or so.
The secondary chip could act the way RAG does, or perhaps both chips together can act as LoRA.
Either way, 99.999% of the knowledge is static, you just need to fine-tune the weights with that remaining 0.001% knowledge, which can be done using RAG or LoRA on a much smaller (thus cheaper) disposable chip.
The better solution would be making part of the chip cluster use something like FPGA which can be reprogrammed.
Text to speech or diagnostics equipment where the core model is relatively small and never changes seems like the ideal application. You might be able to fit something in the 25-30B range in 2nm to 14A, but it would need a way to update.
Large models are simply out of the question in my opinion. If you need 400+ different chip designs, it’ll be billions of dollars to tape out before you even make the first chip.