This is very cool to see - seems like soooo much efficiency waiting to be unlocked at the chip level...

nickpinkston • yesterday at 6:21 PM • 8 replies • view on HN

This is very cool to see - seems like soooo much efficiency waiting to be unlocked at the chip level.

What's everyone think of Taalas?

They're actually burning the LLM model into the silicon, with some onboard memory for fine-tuning. They claim huge cost / latency wins.

Super fast demo live at: https://chatjimmy.ai/

https://taalas.com/

https://www.reddit.com/r/singularity/comments/1r9frzk/taalas...

Replies

jsenn • today at 12:33 PM

Their demo is almost unbelievably fast, but as I understand it, the limitation of Taalas's strategy is KV-cache. This grows with context length, so either needs to be stored in SRAM (small) or streamed in (slow). Even for a tiny model like the Llama 8B they have in their demo, the KV cache will be ~64kb per token at 8-bit quantization, so at a 1,000-token sequence length you are already at 64MB of SRAM for a single user. This is probably why their demo only lets you generate 1,000 tokens: they can't go beyond that without slowing down inference.

So I'm curious what their strategy is. It seems to me that the options are: 1. Target smaller usecases that can live with a tiny context window 2. Use huge amounts of SRAM (at which point they look like Groq or Cerebras) 3. Make it up with extreme KV-cache compression/quantization 4. Run linear-attention/sliding window attention models

Other commenters have mentioned robotics as a potential application, which sounds interesting.

kccqzy • yesterday at 8:53 PM

> seems like soooo much efficiency waiting to be unlocked at the chip level

Well if you are exclusively using GPUs that are general purpose, of course you leave so much efficiency on the table. That’s why Google started making TPUs more than a decade ago. I remember that kerfuffle when Google fired Timnit Gebru when Gebru’s paper used GPUs to calculate the environment impact of LLMs while ignoring the efficiency of TPUs; this basically made Jeff Dean very angry due to that wide efficiency gap.

➕ show 2 replies

Catloafdev • yesterday at 6:27 PM

It'd be cool to see more of this type of thing, but I have to imagine the ability for it to be updated to a brand-new model as new models come out is limited. If that is the case, it's going to be an extremely hard sell.

➕ show 4 replies

martythemaniak • yesterday at 6:35 PM

In a chatbot, 17k tok/s is a neat but nearly useless showcase. In a coding agent it is a meaningful improvement. In robotics, it could be an absolute revolution.

8B models aren't useful in general, but for specific use cases they can provide an enourmous amount of intelligence - nVidia's Tesla/Waymo competitor is a 7B LLM with a 2B diffusion model, and running that at those speeds could be an order of magnitude cheaper than existing solutions.

➕ show 3 replies

typ • today at 2:52 AM

Low latency is nice. But it would be more interesting if they could demonstrate the efficiency of energy consumption.

rebeccajae • yesterday at 9:11 PM

It seems technically interesting, but they seem very sparse on details. I don't know if I like the idea of a single unchanging model forever on a chip. How much more expensive would the silicon be if they used rewritable ROM for the weights? Such an arrangement would permit fine-tunes of the model it was designed for, which might minimize concerns about the model becoming outdated.

➕ show 1 reply

dcchambers • yesterday at 8:04 PM

I think hardware like this is the future for LLM-providers once we reach a point where the models aren't advancing much any more. You could argue we're close now.

The hyperscalers like AWS will made great use of these to serve up models that will be relevant for several years. But right now, we're still seeing significant bumps in model quality every couple of months - especially with open-weight models like Deepseek/Kimi/GLM.

Until that point, though, I don't see how this is ever going to be cost effective vs general purpose hardware.

I also think we'll see miniature versions of this baked into mobile hardware for super fast and efficient on-device LLMs.

➕ show 1 reply

alt Hacker News

Replies