This is not a general purpose chip but specialized for high speed, low latency inference with small ...

dust42 • today at 11:27 AM • 15 replies • view on HN

This is not a general purpose chip but specialized for high speed, low latency inference with small context. But it is potentially a lot cheaper than Nvidia for those purposes.

Tech summary:

  - 15k tok/sec on 8B dense 3bit quant (llama 3.1) 
  - limited KV cache
  - 880mm^2 die, TSMC 6nm, 53B transistors
  - presumably 200W per chip
  - 20x cheaper to produce
  - 10x less energy per token for inference
  - max context size: flexible
  - mid-sized thinking model upcoming this spring on same hardware
  - next hardware supposed to be FP4 
  - a frontier LLM planned within twelve months

This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.

Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.

Not exactly a competitor for Nvidia but probably for 5-10% of the market.

Back of napkin, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Supposedly the inference speed remains almost the same with larger models.

Interview with the founders: https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

Replies

vessenes • today at 12:06 PM

This math is useful. Lots of folks scoffing in the comments below. I have a couple reactions, after chatting with it:

1) 16k tokens / second is really stunningly fast. There’s an old saying about any factor of 10 being a new science / new product category, etc. This is a new product category in my mind, or it could be. It would be incredibly useful for voice agent applications, realtime loops, realtime video generation, .. etc.

2) https://nvidia.github.io/TensorRT-LLM/blogs/H200launch.html Has H200 doing 12k tokens/second on llama 2 12b fb8. Knowing these architectures that’s likely a 100+ ish batched run, meaning time to first token is almost certainly slower than taalas. Probably much slower, since Taalas is like milliseconds.

3) Jensen has these pareto curve graphs — for a certain amount of energy and a certain chip architecture, choose your point on the curve to trade off throughput vs latency. My quick math is that these probably do not shift the curve. The 6nm process vs 4nm process is likely 30-40% bigger, draws that much more power, etc; if we look at the numbers they give and extrapolate to an fp8 model (slower), smaller geometry (30% faster and lower power) and compare 16k tokens/second for taalas to 12k tokens/s for an h200, these chips are in the same ballpark curve.

However, I don’t think the H200 can reach into this part of the curve, and that does make these somewhat interesting. In fact even if you had a full datacenter of H200s already running your model, you’d probably buy a bunch of these to do speculative decoding - it’s an amazing use case for them; speculative decoding relies on smaller distillations or quants to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model.

Upshot - I think these will sell, even on 6nm process, and the first thing I’d sell them to do is speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

I hope these guys make it! I bet the v3 of these chips will be serving some bread and butter API requests, which will be awesome.

➕ show 5 replies

jameslk • today at 10:33 PM

> Certainly interesting for very low latency applications which need < 10k tokens context.

I’m really curious if context will really matter if using methods like Recursive Language Models[0]. That method is suited to break down a huge amount of context into smaller subagents recursively, each working on a symbolic subset of the prompt.

The challenge with RLM seemed like it burned through a ton of tokens to trade for more accuracy. If tokens are cheap, RLM seems like it could be beneficial here to provide much more accuracy over large contexts despite what the underlying model can handle

0. https://arxiv.org/abs/2512.24601

soleveloper • today at 12:38 PM

In 20$ a die, they could sell Gameboy style cartridges for different models.

➕ show 2 replies

Aissen • today at 1:12 PM

> 880mm^2 die

That's a lot of surface, isn't it? As big an M1 Ultra (2x M1 Max at 432mm² on TSMC N5P), a bit bigger than an A100 (820mm² on TSMC N7) or H100 (814mm² on TSMC N5).

> The larger the die size, the lower the yield.

I wonder if that applies? What's the big deal if a few parameter have a few bit flips?

➕ show 1 reply

elternal_love • today at 11:47 AM

Were we go towards really smart roboters. It is interesting what kind of diferent model chips they can produce.

➕ show 1 reply

aurareturn • today at 11:30 AM

Don’t forget that the 8B model requires 10 of said chips to run.

And it’s a 3bit quant. So 3GB ram requirement.

If they run 8B using native 16bit quant, it will use 60 H100 sized chips.

➕ show 1 reply

xnx • today at 10:05 PM

Hardware decoders make sense for fixed codecs like MPEG, but I can't see it making sense for small models that improve every 6 months.

WhitneyLand • today at 1:44 PM

There’s a bit of a hidden cost here… the longevity of GPU hardware is going to be longer, it’s extended every time there’s an algorithmic improvement. Whereas any efficiency gains in software that are not compatible with this hardware will tend to accelerate their depreciation.

bsenftner • today at 12:42 PM

Do not overlook traditional irrational investor exuberance, we've got an abundance of that right now. With the right PR manouveurs these guys could be a tulip craze.

mikhail-ramirez • today at 2:37 PM

Yea its fast af but very quickly loses context/hallucinates from my own tests with large chunks of text

oliwary • today at 11:29 AM

This is insane if true - could be super useful for data extraction tasks. Sounds like we could be talking in the cents per millions of tokens range.

zozbot234 • today at 11:57 AM

Low-latency inference is a huge waste of power; if you're going to the trouble of making an ASIC, it should be for dog-slow but very high throughput inference. Undervolt the devices as much as possible and use sub-threshold modes, multiple Vt and body biasing extensively to save further power and minimize leakage losses, but also keep working in fine-grained nodes to reduce areas and distances. The sensible goal is to expend the least possible energy per operation, even at increased latency.

➕ show 2 replies

robotnikman • today at 6:15 PM

Sounds perfect for use in consumer devices.

Tepix • today at 1:11 PM

Doesn't the blog state that it's now 4bit (the first gen was 3bit + 6bit)?

empath75 • today at 1:45 PM

An on-device reasoning model what that kind of speed and cost would completely change the way people use their computers. It would be closer to star trek than anything else we've ever had. You'd never have to type anything or use a mouse again.

alt Hacker News

Replies