logoalt Hacker News

jdlshoretoday at 2:49 PM6 repliesview on HN

“Our systematic study exposes a phenomenon of constraint decay in LLM-based coding agents. While current models excel at unconstrained generation, their performance drops when forced to navigate explicit architectural rules. For end-users, this dichotomy implies that agents are reliable for rapid prototyping but remain unreliable for production-grade backend development.”

One major weakness of this study is that they didn’t fully test frontier models for cost reasons, so the specific performance results should be taken with a grain of salt. But the overall conclusion that models degrade when both behavior and architecture must be correct is interesting, and something to keep an eye on.


Replies

zdragnartoday at 7:36 PM

I've noticed something similar with AI assist authored books as well. Early on it does alright, but after some chapters the beginning of each chapter repeats the end of the previous, and obvious LLM tells become more frequent.

The more it has to go on, the more it relies on repetition of what came before. It's also possible that authors start paying much less attention and put less effort into editing later chapters.

Despite the sheer volume on Amazon, LLMs are not at the point of writing well.

show 1 reply
qsorttoday at 4:16 PM

I think it's downstream of "you can't optimize for two different objectives".

If you only have functional requirements, then in effect you're doing some form of program synthesis, and RL can optimize that very hard.

If you have a mixture of functional and non-functional requirements, you are basically giving the model an incomplete specification, and it must in some way guess at the user's intent to fill in the blanks. This is also why adding to the prompt examples of the style of code you want (hats off to antirez for this particular tip ;)) is phenomenally powerful.

show 1 reply
Animatstoday at 7:03 PM

That may be the same problem seen when prompts try to force "alignment" or "guardrails". There's a performance drop. Seemingly, a big chunk of the potential solution space has been made unreachable.

For example, if you apply "guardrails" to an image generator of about a year ago, all the people start looking alike. Story generators start using only a few standard names.

That was last year. Is it happening with the frontier models?

nijavetoday at 4:36 PM

Hmm, I have some anecdotal evidence this is true. Interactively working out a plan with Opus on multiple occasions it'd come up with an incompatible solution, I'll add additional context/requirements, and it has a tendency to "anchor" on it's original architecture and struggles to adapt. Sometimes it tries to sneak in changes for the original plan anyway.

show 2 replies
jeremyjhtoday at 4:20 PM

Even the strongest frontier model they used - GPT 5.2 - I would consider barely usable for agentic programming.

I’m not really interested in analysis of the weaknesses of such models because in my experience many weaknesses disappear entirely as models get stronger and reasoning effort is turned up. Especially if you tell them what you want them to do.

Also, it’s not surprising to learn that when more acceptance criteria are added the failure rate increases.

show 1 reply
xienzetoday at 4:55 PM

> their performance drops when forced to navigate explicit architectural rules

Even the best models have trouble adhering to stuff as mundane as rules for how to style generated code (indent this much, name things with these patterns, etc.). Even the most die-hard AI-first coder will admit to that kind of stuff being not unheard-of. Yet they still delude themselves into thinking that these models will follow a sufficiently detailed spec to the letter, every time.