i'd love to get to a point where big models can launch subagents that are fast and local. there's a lot of focus on token rate, but just as much, the way cloud providers have other latencies & processing styles not optimized for latency (running large batches all at once), and i think local might have some real wins. Gemma 4 seems already on the right track. lfm2.5-8b-a1b (https://www.liquid.ai/blog/lfm2-5-8b-a1b) and DiffusionGemma seem to both be very high token rate. but getting that latency down, so that a series of tool calls can happen faster, would be a real win. I think especially with good prompting that becomes much more possible.
One caveat, I have absolutely no patience for a lot of subagent systems, like opencode, where the subagent is walled off and incommunicatable. My subagents really should be their own session, that i can deal with as I please, with some MessageChannel like offerings/tools available to them. Ideally with modes where messages auto-flow in and out, and modes where I can be a gate-monitor. https://developer.mozilla.org/en-US/docs/Web/API/MessageChan...
Not really super related but MCP has been working on Events for a while. That ability to respond fast would be great. https://github.com/modelcontextprotocol/experimental-ext-tri...
Asking local to be fast feels like an obvious folly, but given how much better small models have got, and seeing these models tune themselves for speed: I want to hope!