Releasing this on the same day as Taalas's 16,000 token-per-second acceleration for the roughly comparable Llama 8B model must hurt!
I wonder how far down they can scale a diffusion LM? I've been playing with in-browser models, and the speed is painful.
Just tried this. Holy fuck.
I'd take an army of high-school graduate LLMs to build my agentic applications over a couple of genius LLMs any day.
This is a whole new paradigm of AI.
This is exceptionally fast (almost instant) whats the catch? Answer was there before I lifted return key!
This is crazy!
Nothing to do with each other. This is a general optimization. Taalas' is an ASIC that runs a tiny 8B model on SRAM.
But I wonder how Taalas' product can scale. Making a custom chip for one single tiny model is different than running any model trillions in size for a billion users.
Roughly, 53B transistors for every 8B params. For a 2T param model, you'd need 13 trillion transistor assuming scale is linear. One chip uses 2.5 kW of power? That's 4x H100 GPUs. How does it draw so much power?
If you assume that the frontier model is 1.5 trillion models, you'd need an entire N5 wafer chip to run it. And then if you need to change something in the model, you can't since it's physically printed on the chip. So this is something you do if you know you're going to use this exact model without changing anything for years.
Very interesting tech for edge inference though. Robots and self driving can make use of these in the distant future if power draw comes down drastically. 2.4kW chip running inside a robot is not realistic. Maybe a 150w chip.