The form factor discussion is fascinating but I think the real unlock is latency. Current cloud infe...

MarcLore • yesterday at 10:49 AM • 5 replies • view on HN

The form factor discussion is fascinating but I think the real unlock is latency. Current cloud inference adds 50-200ms of network overhead before you even start generating tokens. A dedicated ASIC sitting on PCIe could serve first token in microseconds.

For applications like real-time video generation or interactive agents that need sub-100ms response loops, that difference is everything. The cost per inference might be higher than a GPU cluster at scale, but the latency profile opens up use cases that simply aren't possible with current architectures.

Curious whether Taalas has published any latency benchmarks beyond the throughput numbers.

Replies

muyuu • yesterday at 12:47 PM

latency and control, and reliability of bandwidth and associated costs - however this isn't just the pull for specialised hardware but for local computing in general, specialised hardware is just the most extreme form of it

there are tasks that inherently benefit from being centralised away, like say coordination of peers across a large area - and there are tasks that strongly benefit from being as close to the user as possible, like low latency tasks and privacy/control-centred tasks

simultaneously, there's an overlapping pull to either side caused by the monetary interests of corporations vs users - corporations want as much as possible under their control, esp. when it's monetisable information but most things are at volume, and users want to be the sole controller of products esp. when they pay for them

we had dumb terminals already being pushed in the 1960s, the "cloud", "edge computing" and all forms of consolidation vs segregation periods across the industry, it's not going to stop because there's money to be made from the inherent advantages of those models and even the industry leaders cannot prevent these advantages from getting exploited by specialist incumbents

once leaders consolidates, inevitably they seek to maximise profit and in doing so they lower the barrier for new alternatives

ultimately I think the market will never stop demanding just having your own *** computer under your control and hopefully own it, and only the removal of this option will stop this demand; while businesses will never stop trying to control your computing, and providing real advantages in exchange for that, only to enter cycles of pushing for growing profitability to the point average users keep going back and forth

sowbug • yesterday at 6:02 PM

As scary as it sounds today, a lightning-quick zero latency non-networked local LLM could provide value in an application like a self-driving car. It would be a level below Waymo's remote human support, so if the car couldn't figure out how to deal with a weird situation, it could ask the LLM what to do, hopefully avoiding the need to phone home (and perhaps handling cases where it couldn't phone home).

➕ show 1 reply

cedws • yesterday at 2:16 PM

The network latency bit deserves more attention. I’ve been trying to find out where AI companies are physically serving LLMs from but it’s difficult to find information about this. If I’m sitting in London and use Claude, where are the requests actually being served?

The ideal world would be an edge network like Cloudflare for LLMs so a nearby POP serves your requests. I’m not sure how viable this is. On classic hardware I think it would require massive infra buildout, but maybe ASICs could be the key to making this viable.

➕ show 1 reply

BoredomIsFun • yesterday at 7:26 PM

No, not in milliseconds if you have longish context. Prefill is very compute heavy, compared to inference.

cyanydeez • yesterday at 4:49 PM

Id assume the next step is a small reasoning model would demo whether inference speed can fill some intelligence gaps. Combine that with some RAG to see if theres a tension in intrinsic reason or pattern recognition.

alt Hacker News

Replies