Genuine question: what's the evidence that the architect → developer → reviewer pipeline actual...

akhrail1996 • today at 6:18 AM • 24 replies • view on HN

Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

The author uses different models for each role, which I get. But I run production agents on Opus daily and in my experience, if you give it good context and clear direction in a single conversation, the output is already solid. The ceremony of splitting into "architect" and "developer" feels like it gives you a sense of control and legibility, but I'm not convinced it catches errors that a single model wouldn't catch on its own with a good prompt.

Replies

arialdomartini • today at 7:56 AM

This is anecdotal but just a couple days ago, with some colleagues, we conducted a little experiment to gather that evidence.

We used a hierarchy of agents to analyze a requirement, letting agents with different personas (architect, business analyst, security expert, developer, infra etc) discuss a request and distill a solution. They all had access to the source code of the project to work on.

Then we provided the very same input, including the personas' definition, straight to Claude Code, and we compared the result.

They council of agents got to a very good result, consuming about 12$, mostly using Opus 4.6.

To our surprise, going straight with a single prompt in Claude Code got to a similar good result, faster and consuming 0.3$ and mostly using Haiku.

This surely deserves more investigation, but our assumption / hypothesis so far is that coordination and communication between agents has a remarkable cost.

Should this be the case, I personally would not be surprised:

- the reason why we humans do job separation is because we have an inherent limited capacity. We cannot reach the point to be experts in all the needed fields : we just can't acquire the needed knowledge to be good architects, good business analysts, good security experts. Apparently, that's not a problem for a LLM. So, probably, job separation is not a needed pattern as it is for humans.

- Job separation has an inherent high cost and just does not scale. Notably, most of the problems in human organizations are about coordination, and the larger the organization the higher the cost for processes, to the point processed turn in bureaucracy. In IT companies, many problems are at the interface between groups, because the low-bandwidth communication and inherent ambiguity of language. I'm not surprised that a single LLM can communicate with itself way better and cheaper that a council of agents, which inevitably faces the same communication challenges of a society of people.

➕ show 5 replies

moduspol • today at 3:44 PM

To me, such techniques feel like temporary cudgels that may or may not even help that will be obsolete in 1-6 months.

This is similar to telling Claude Code to write its steps into a separate markdown file, or use separate agents to independently perform many tasks, or some of the other things that were commonly posted about 3-6+ months ago. Now Claude Code does that on its own if necessary, so it's probably a net negative to instruct it separately.

Some prompting techniques seem ageless (e.g. giving it a way to validate its output), but a lot of these feel like temporary scaffolding that I don't see a lot of value in building a workflow around.

➕ show 1 reply

kybernetikos • today at 7:33 AM

There's a lot of cargo culting, but it's inevitable in a situation like this where the truth is model dependent and changing the whole time and people have created companies on the premise they can teach you how to use ai well.

➕ show 1 reply

xnx • today at 9:30 PM

We are at the horseless carriage stage where people are recreating the old system without considering if it is necessary.

never_inline • today at 9:39 AM

I think this is just anthropomorphism. Sub agents make sense as a context saving mechanism.

Aider did an "architect-editor" split where architect is just a "programmer" who doesn't bother about formatting the changes as diff, then a weak model converts them into diffs and they got better results with it. This is nothing like human teams though.

➕ show 1 reply

jaredklewis • today at 6:59 AM

> what's the evidence

What’s the evidence for anything software engineers use? Tests, type checkers, syntax highlighting, IDEs, code review, pair programming, and so on.

In my experience, evidence for the efficacy of software engineering practices falls into two categories:

- the intuitions of developers, based in their experiences.

- scientific studies, which are unconvincing. Some are unconvincing because they attempt to measure the productivity of working software engineers, which is difficult; you have to rely on qualitative measures like manager evaluations or quantitative but meaningless measures like LOC or tickets closed. Others are unconvincing because they instead measure the practice against some well defined task (like a coding puzzle) that is totally unlike actual software engineering.

Evidence for this LLM pattern is the same. Some developers have an intuition it works better.

➕ show 5 replies

lbreakjai • today at 8:58 AM

The different models is a big one. In my workflow, I've got opus doing the deep thinking, and kimi doing the implementation. It helps manage costs.

Sample size of one, but I found it helps guard against the model drifting off. My different agents have different permissions. The worker can not edit the plan. The QA or planner can't modify the code. This is something I sometimes catch codex doing, modifying unrelated stuff while working.

➕ show 1 reply

jumploops • today at 7:16 AM

After "fully vibecoding" (i.e. I don't read the code) a few projects, the important aspect of this isn't so much the different agents, but the development process.

Ironically, it resembles waterfall much more so than agile, in that you spec everything (tech stack, packages, open questions, etc.) up front and then pass that spec to an implementation stage. From here you either iterate, or create a PR.

Even with agile, it's similar, in that you have some high-level customer need, pass that to the dev team, and then pass their output to QA.

What's the evidence? Admittedly anecdotal, as I'm not sure of any benchmarks that test this thoroughly, but in my experience this flow helps avoid the pitfall of slop that occurs when you let the agent run wild until it's "done."

"Done" is often subjective, and you can absolutely reach a done state just with vanilla codex/claude code.

Note: I don't use a hierarchy of agents, but my process follows a similar design/plan -> implement -> debug iteration flow.

totomz • today at 7:17 AM

I think the splitting make sense to give more specific prompts and isolated context to different agents. The "architect" does not need to have the code style guide in its context, that actually could be misleading and contains information that drives it away from the architecture

➕ show 1 reply

est • today at 7:19 AM

> the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

There's a 63 pages paper with mathematical proof if you really into this.

https://arxiv.org/html/2601.03220v1

My takeaway: AI learns from real-world texts, and real-world corpus are used to have a role split of architect/developer/reviewer

➕ show 2 replies

zingar • today at 10:45 AM

Nitpick: I don’t think architect is a good name for this role. It’s more of a technical project kickoff function: these are the things we anticipate we need to do, these are the risks etc.

I do find it different from the thinking that one does when writing code so I’m not surprised to find it useful to separate the step into different context, with different tools.

Is it useful to tell something “you are an architect?” I doubt it but I don’t have proof apart from getting reasonable results without it.

With human teams I expect every developer to learn how to do this, for their own good and to prevent bottlenecks on one person. I usually find this to be a signal of good outcomes and so I question the wisdom of biasing the LLM towards training data that originates in spaces where “architect” is a job title.

stavros • today at 10:52 AM

It's not about splitting for quality, it's about cost optimisation (Sonnet implements, which is cheaper). The quality comes with the reviewers.

Notice that I didn't split out any roles that use the same model, as I don't think it makes sense to use new roles just to use roles.

Havoc • today at 8:38 AM

Yeah always seemed pretty sus to me to.

At the same time I can see a more linear approach doing similar. Like when I ask for an implementation plan that is functional not all that different from an architect agent even if not wrapped in such a persona

➕ show 1 reply

Tarq0n • today at 8:12 AM

In machine learning, ensembles of weaker models can outperform a single strong model because they have different distributions of errors. Machine learning models tend to have more pronounced bias in their design than LLMs though.

So to me it makes sense to have models with different architecture/data/post training refine each other's answers. I have no idea whether adding the personas would be expected to make a difference though.

hakanderyal • today at 7:52 AM

One added benefit is it allows you to throw more tokens to the problem. It’s the most impactful benefit even.

Context & how LLMs work requires this.

From my experience no frontier model produces bug free & error free code with the first pass, no matter how much planning you do beforehand.

With 3 tiers, you spend your token & context budget in full in 3 phases. Plan, implement, review.

If the feature is complex, multiple round of reviews, from scratch.

It works.

dep_b • today at 10:23 AM

“…if you give it good context…” that’s what the architect session is for basically. You throw around ideas and store the direction you want to go.

Then you execute it with a clean context.

Clean context is needed for maximum performance while not remembering implementation dead ends you already discarded

jwilliams • today at 10:00 AM

If you know what you need, my experience is that a well-formed single-prompt that fits the context gives the best results (and fastest).

If you’re exploring an idea or iterating, the roles can help break it down and understand your own requirements. Personally I do that “away” from the code though.

palmotea • today at 7:14 AM

> Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

Using multiple agents in different roles seems like it'd guard against one model/agent going off the rails with a hallucination or something.

fleetfox • today at 9:04 AM

Even for reducing the context size it's probably worth it. If you have to go back back and forth on both problem and implementation even with these new "large" contexts if find quality degrading pretty fast.

luxcem • today at 9:13 AM

The agent "personalities" and LLM workflow really looks like cargo-cult behavior. It looks like it should be better but we don't really have data backing this.

troupo • today at 6:42 AM

> produces better results than just... talking to one strong model in one session?

I think the author admits that it doesn't, doesn't realise it and just goes on:

--- start quote ---

On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet

--- end quote ---

awesome_dude • today at 7:51 AM

I have been using different models for the same role - asking (say) Gemini, then, if I don't like the answer asking Claude, then telling each LLM what the other one said to see where it all ends up

Well I was until the session limit for a week kicked in.

imiric • today at 6:55 AM

Evidence? My friend, most of the practices in this field are promoted and adopted based on hand-waving, feelings, and anecdata from influencers.

Maybe you should write and share your own article to counter this one.

➕ show 1 reply

alt Hacker News

Replies