> my first bet would be specifications and tests
You are missing another dimension how easy it would be to migrate if adding new feature hits a ceiling and LLM keeps breaking the system.
Imagine all tests are passing and code is confirming the spec, but everything is denormalized because LLM thought this was a nice idea at the beginning since no one mentioned that requirement in the spec. After a while you want to add a feature which requires normalized table and LLM keeps failing, but you also have no idea how this complex system works.
Don't forget that very very detailed spec is actually the code
Software engineering has always worked this way, just not to ICs.
“The LLMs produce non-deterministic output and generate code much faster than we can read it, so we can’t seriously expect to effectively review, understand, and approve every diff anymore. But that doesn’t necessarily mean we stop being rigorous, it could mean we should move rigor elsewhere.“
Direct reports, when delegated tasks by managers, product non-deterministic outputs much faster than team leads/managers can review, understand or approve every diff. Being a manager of software developers has always been a non-deterministic form of software engineering.
> just like we don’t read assembly, or bytecode, or transpiled JavaScript
This makes sense since certain higher-level code produces certain lower-level code, while LLM cannot. If the transpired JS code doesn't work we could just find out the bug in minifiers, etc. but one cannot figure out why LLM fails at one task, especially considering LLMs, even SOTA ones, could be strongly affected by even small prompt changes. Taking this into consideration, I don't think this is a sound reasoning why we don't need to review ai-generated code.
> The LLMs produce non-deterministic output and generate code much faster than we can read it, so we can’t seriously expect to effectively review, understand, and approve every diff anymore.
Exactly. However, this could also indicate a weaker review standard instead of just dropping review. We could also suggest an idea where devs mainly review code design or interfaces, leveraging one's *taste*, while leaving strict logic reasoning, validating and testing to other tools or approaches. It cannot pursuade me that the nature of LLM's code generation must lead to a complete cancel of the code review.
Anyway, I'm not opposing this article and its thought of shift in the future is really good.
> If I had to roll out such a development process today, I’d make a standardized Markdown specification the new unit of knowledge for the software project. Product owners and engineers could initially collaborate on this spec and on test cases to enforce business rules. Those should be checked into the project repositories along with the implementing code. There would need to be automated pull-request checks verifying not only that tests pass but that code conforms to the spec. This specification, and not the code that materializes it, is what the team would need to understand, review, and be held accountable for.
The constant urge I have today is for some sort of spec or simpler facts to be continuously verified at any point in the development process; Something agents would need to be aware of. I agree with the blog and think it's going to become a team sport to manage these requirements. I'm going to try this out by evolving my open source tool [1] (used to review specs and code) into a bit more of a collaborative & integrated plane for product specs/facts - https://plannotator.ai/workspaces/
I prefer "the bottleneck is understanding" framing.
The author is nibbling at the same problem ultimately, but i don't think "hey one strategy is we could just let cognitive debt accumulate so we can go faster!" is a particularly insightful tool in the toolbox. Don't misread me, i'm not denying it can be a valid strategy.
Instead i want to read about insightful strategies for optimising that system-wide bottleneck we have: understanding.
Tell me about how you managed to shift to a higher level of abstraction, tell me about how and when that abstraction leaks. Tell me how you reduced the amount of information that has to flow through the system bottleneck.
> We can’t leverage agents if our unit of work is still “add a new endpoint to the RESTful API”
Why not? You just make every task faster. Not everything has to be an uncontrollable rocket launch.
> We need a virtually infinite supply of requirements, engineers acting as pseudo-product designers, owning entire streams of work
Why? To build what? You can only build as fast as you understand the business and your users.
>... my first bet would be specifications ... and tests ... If I had to roll out such a development process today, I’d make a standardized Markdown specification the new unit of knowledge for the software project.
I've found that adopting RFC Keywords (e.g. RFC 2119 [1]; MUST, SHOULD, MAY) at least makes the LLM report satisfaction. I'd love to see a proper study on the usage of RFC keywords and their effect on compliance and effectiveness.
> We can stop reading LLM-generated code just like we don’t read assembly, or bytecode, or transpiled JavaScript; our high-level language source would now be another form of machine code
This is too weird for me. At least with programming languages I can consult the documentation and if the programming language isn’t behaving as documented, it’s obviously a defect and if you’re savvy enough you often have open channels that accept contributions. Can we say the same for Claude or other AI solutions?
The underlying mechanism is still the same: humans type and products come out.
So something which must be true if this author is right is that whatever the new language is—the thing people are typing into markdown—must be able to express the same rigor in less words than existing source code.
Otherwise the result is just legacy coding in a new programming language.
> Rework is almost free
Is it? All the electricity and capital investment in computing hardware costs real money. Is this properly reflected in the fees that AI companies charge or is venture capital propping each one up in the hope that they will kill off the competition before they run out of (usually other people's) money?
The lesson I've learned from our new AI age is how little a large number of people who've worked in software development their entire careers understand software development.
I suppose all the money floating around AI helps dummify everything, as people glom on to narratives, regardless of merit, that might position them to partake.
What we actually have now is the ability to bang out decent quality code really fast and cheaply.
This is massive, a huge change, one which upends numerous assumptions about the business of software development.
...and it only leaves us to work through every other aspect of software development.
The approach this article advocates is to essentially pretend none of this exists. Simple, but will rarely produce anything of value.
This paragraph from the post gives you the gist of it:
> ...we need to remove humans-in-the-loop, reduce coordination, friction, bureaucracy, and gate-keeping. We need a virtually infinite supply of requirements, engineers acting as pseudo-product designers, owning entire streams of work, with the purview to make autonomous decisions. Rework is almost free so we shouldn’t make an effort to prevent incorrect work from happening.
As if the only reason we ever had POs or designers or business teams, or built consensus between multiple people, or communicated with others, or reviewed designs and code, or tested software, was because it took individual engineers too long to bang out decent code.
AI has just gotten people completely lost. Or I guess just made it apparent they were lost the whole time?
> Product owners and engineers could initially collaborate on this spec and on test cases to enforce business rules. Those should be checked into the project repositories along with the implementing code. There would need to be automated pull-request checks verifying not only that tests pass but that code conforms to the spec. This specification, and not the code that materializes it, is what the team would need to understand, review, and be held accountable for.
This just sounds like typical requirements management software (IBM DOORS for example, which has been around since the 90s).
It's kind of funny how AI evangelists keep re-discovering the need for work methods and systems that have existed for decades.
When I worked as a software developer at a big telecom company and I had no say in what the software was supposed to do, that was up to the software design people--they were the ones responsible for designing the software and defining all the requirements--I was just responsible for implementing that behavior in code.
My amazon orgs leadership has been obsessed with spec driven development while individual engineers tell me the only use they have is to placate leadership. I'm tired
I wonder if with the speed of iteration with AI the industry will switch back to waterfall. Clear documentation first so the LLM can easily produce what's being asked with a round of testing before going back to the documentation stage and running it again. History does repeat itself.
> We can stop reading LLM-generated code just like we don’t read assembly, or bytecode, or transpiled JavaScript; our high-level language source would now be another form of machine code.
My opinion is very close to this. Currently the reason that it's bad to not reviewing/testing the code LLMs generated is because the LLMs can sometime generate bad codes. But it's a bug that can be improved. One day you'll have LLMs generating code consistently better than what a human could write. And then you just stop needing to review them. (And that's probably also the time where most programmers/developers got fired too)
Don't get surprised if anyday the LLMs starts to generate binaries directly. THAT will be impossible to read and costs more time to analyze.
> "I'd make a standardized Markdown specification the new unit of knowledge for the software project. ... There would need to be automated pull-request checks verifying not only that tests pass but that code conforms to the spec."
Agree, this is how you make the development loop more deterministic and ultimately autonomous. It's how I've been using coding agents myself for the past few months (by building my own to support this natively [1]).
If you have a spec you approve/agree on, have an agent code against it, and then have a review phase verify the implementation didn't drift from the spec (either by adding or removing features), you get to a position where you can trust the outcome.
There's still a lot to be said about spec definition and what if during implementation gaps are discovered, and that's where HITL comes into play.
This could very well be a pattern that some teams evolve into. Specs are the new source -- they describe the architectural approach, as well as the business rules and user experience details. End to end tests are described here too. This all is what goes through PRs and review process, and the code becomes a build artifact.
"A sufficiently precise spec is code". I've read somewhere here before.
So guardrails, i.e. sufficiently precise spec and tests, will need to be as strict as the LLM is bad at getting the right context and asking back the right questions. I suppose at that point not much difference between a human engineer and it.
In short:
We will have code full of unknown bugs, that is unfixable.
The solution is to replace it with more of the same but with some new specification (fix some bug add some new feature).
And this will be done by using astounding amounts of compute in massive new data centres.
Yeah, this has been my process for months now.
I might even start my own blog to write about things I've found.
1. Always get the agent to create a plan file (spec). Whatever prompt you were going to yolo into the agent, do it in Plan Mode first so it creates a plan file.
2. Get agents to iterate on the plan file until it's complete and thorough. You want some sort of "/review-plan <file>" skill. You extend it over time so that the review output is better and better. For example, every finding should come with a recommended fix.
3. Once the plan is final, have an agent implement it.
4. Check the plan in with the impl commit.
The plan is the unit of work really since it encodes intent. Impl derives from it, and bugs then become a desync from intent or intent that was omitted. It's a nicer plane to work at.
From this extends more things: PRs should be plan files, not code. Impl is trivial. The hard part is the plan. The old way of deriving intent from code sucked. Why even PR code when we haven't agreed on a plan/intent?
This process also makes me think about how code implementation is just a more specific specification about what the computer should do. A plan is a higher level specification. A one-line prompt into an LLM is the highest level specification. It's kinda weird to think about.
Finally, this is why I don't have to read code anymore. Over time, my human review of the code unearthed fewer and fewer issues and corrections to the point where it felt unnecessary. I only read code these days so I can impose my preferences on it and get a feel for the system, but one day you realize that you can accumulate your preferences (like, use TDD and sum types) in your static prompt/instructions. And you're back to watching this thing write amazing code, often better than what you would have written unless you have maximum time + attention + energy + focus no matter how uninteresting the task, which you don't.
Entertaining flag name!
React team seems to really have set a precedent with their "dangerouslySetInnerHTML" idea.
Or did they borrow it somewhere?
I'm just curious about that etymology, of course the idea is not universally helpful: for example, for dd CLI parameters, it would only make a mess.
But when there's a flag/option that really requires you to be vigilant and undesired the input and output and all edge cases, calling it "dangerous" is quite a feat!
> There would need to be automated pull-request checks verifying not only that tests pass but that code conforms to the spec.
As I understand, this is an unsolved problem.
I found that adding "philosophy" descriptions help guide the tooling. No specs, just general vibes what's the point, because we can't make everyone happy and it's not a goal of a good tool (I believe).
Technology, implementation may change, but general point of "why!?" stays.
> Then where does the rigor go? Similar to the Thoughtworks report, my first bet would be specifications (which is not the same as prompts) and tests (which is not the same as TDD).
This is what we're building for at Saldor (https://saldor.com). It's a hard problem, to get a team in the habit of writing good specs. Probably because it's a hard thing to do: thinking of the behavior of your program, especially at the edges. But I agree (biased) that this is probably the way forward for writing code in the near future. I'm excited to see other people thinking about it.
I feel like people who program in JavaScript or whose projects pull megabytes of dependencies, don’t get a moral right to complain about this. You guys just sit and calm down this time, you already said what you could.
Your app takes 20 seconds to load, pulling 50 megabytes of minified JS. Your backend is a mess of 20 Rust microservices, 300 megabytes docker image each.
Nobody has actually been reading and understanding code in your org for the past 15 years. And nobody has ever been responsible, everybody has just been job hopping for a 15% total comp bump.
Now the secret is out.
the irony is that AI is making this exact problem worse. ppl are generating entire codebases now without reading any of it -- the flag might as well be the default. the skill thats actually becoming scarce isnt writing code, its reading code you didnt write and knowing if its correct.
markdown became the language I hate the most thank to LLMs and specs-driven approach. everything feels so dumb right now in agentic coding. looping blindlessly and aimlessly until it compiles then until the playwright server or whatever devtools shows that it somehow works. push the code, have a llm autoreview/autofix,push to prod, run a mythos (perfect name) to identify the bug that opus 4.7 create. loops on loops on loops of some kind of zombie processes running to a "goal" that everyone seems to mystify in talks to just hide the fact that we do nothing anymore. the bottleneck never was code. it was the gate that was keeping away the Elizabeth Holmes and SBF from software engineering and it just opened.
making the review artifact explicit feels like the part teams skip
> Product owners and engineers could initially collaborate on this spec and on test cases to enforce business rules.
LOL. I had to check if this was published on April 1st.
it's the most honest framing I've seen, but specs as the new source of truth is exactly what we promised ourselves with UML, then WSDL, then OpenAPI. the graveyard of just make the artifact above the code authoritative is long
I legit can't tell if this article is satire, or not.
very true. and we already know and agree with this.
user experience/what the app actually does >>> actually implementing it.
elon musk said this a looong time ago. we move from layer 1 (coding, how do we implement this?) to layer 2 thinking (what should the code do? what do we code? should we implement this? (what to code to get the most money?))
this is basic knowledge
Instead of accepting 20,000 lines of slop per PR (and never-ending combinatorial complexity), maybe we should aim to think about abstractions and how to steer LLMs to generate code similar to that of a skilled human developer. Then it could actually be a maintainable artifact by humans and LLMs alike.
Does this post mark the top of the hype train or is there still more to come?
[flagged]
[flagged]
[flagged]
[flagged]
Author here. I'm surprised to see this surfacing now. I just wanted to clarify, since apparently the post doesn't do a good job at it, that what I discussed there is not a methodology I advocate for. The point of the post was: ok, since there are organizations mandating to maximize speed by reducing time spent on typing code (or even mandating to maximize agents usage), is there a way we can meet that requirement while still preserving the rigor somewhere else?
This was a follow up to a previous article[1] and the pair tried to express what I still think today (using AI daily at work): every time I use AI for coding, to some capacity I'm sacrificing system understanding and stability in favor of programming speed. This is not necessarily always a bad tradeoff, but I think it's important to constantly remind ourselves we are making it.
[1] https://olano.dev/blog/tactical-tornado/