I think 'actual parallelism' is a vastly easier and more fruitful way to get better perfor...

killcoder • today at 3:01 AM • 1 reply • view on HN

I think 'actual parallelism' is a vastly easier and more fruitful way to get better performance out of these kinds of systems, compared to pushing against single-threaded faster generation. Tool calling and responses are often embarrassingly parallel. Code generation tasks naturally have a dependency tree that can be unrolled into a fixed budget of parallelism. Tasks can be hierarchically decomposed into subtasks.

It's the same asynchronous stream pattern we're used to dealing with in regular software engineering. We have a fixed thread pool, lots of work that can be scheduled concurrently. Since these are streams, we can do the compute incrementally to reduce the time-to-first-byte/token/response.

Since so many tool calls are inherently asynchronous, and subagent task decomposition can be modelled as such, the IO streams can be oversubscribed, and incoming responses can be priority queued.

On the intelligence front, it's incredible how much better frontier models perform when you just interrupt them every so often and go 'is that the best you can do?', or re-iterate instructions, or repeat the overall goal. I find instruction following _so poor_, especially for 'presentation layer' aspects. Yet if I ask the model to rewrite its last response, it does so perfectly. Why can't the model do this 'internally' and save me having to say 'try again'!

Just because the 'model' is autoregressive doesn't mean the system as a whole needs to present a single stream of immutable text.

Replies

warmedcookie • today at 11:48 AM

I do this kind of parallelism with a little merge request tool I slopped together. I spin up multiple small agents and assign them specific code review tasks (security, coding standards, etc.) and have it spit out a gitlab API draft json object with code examples for the MR I can deterministically validate against. If it fails to insert code examples (depending on the task) and the proper json object schema, I have "ask it to try again" logic in place.

Works fine, forcing LLMs to output parsable responses is a good workaround to get them to do what you want until they improve. It also allows you to use the fast models (ex. I spin up the Gemini 3.1 flash lite model for these tasks) to have these tasks done in seconds rather than minutes.

➕ show 1 reply

alt Hacker News

Replies