> One thing I’ve noticed is that different people get wildly different results with LLMs, so I suspect there’s some element of how you’re talking to them that affects the results.
It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.
In my experience the differences are mostly in how the code produced by the LLM is reviewed. Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding. And those who rarely or never reviewed code from other developers are invariably going to miss stuff and rate the output they get higher.
This definitely is the case. I was talking to someone complaining about how llms don't work good.
They said it couldn't fix an issue it made.
I asked if they gave it any way to validate what it did.
They did not, some people really are saying "fix this" instead of saying "x fn is doing y when someone makes a request to it. Please attempt to fix x and validate it by accessing the endpoint after and writing tests"
Its shocking some people don't give it any real instruction or way to check itself.
In addition I get great results doing voice to text with very specific workflows. Asking it to add a new feature where I describe what functions I want changed then review as I go vs wait for the end.
I have 30 years of experience delivering code and 10 years of leading architecture. My argument is the only thing that matters is does the entire implementation - code + architecture (your database, networking, your runtime that determines scaling, etc) meet the functional and none functional requirements. Functional = does it meet the business requirements and UX and non functional = scalability, security, performance, concurrency, etc.
I only carefully review the parts of the implementation that I know “work on my machine but will break once I put in a real world scenario”. Even before AI I wasn’t one of the people who got into geek wars worrying about which GOF pattern you should have used.
All except for concurrency where it’s hard to have automated tests, I care more about the unit or honestly integration tests and testing for scalability than the code. Your login isn’t slow because you chose to use a for loop instead of a while loop. I will have my agents run the appropriate tests after code changes
I didn’t look at a line of code for my vibe coded admin UI authenticated with AWS cognito that at most will be used by less than a dozen people and whoever maintains it will probably also use a coding agent. I did review the functionality and UX.
Code before AI was always the grind between my architectural vision and implementation
It's not skill with talking to an LLM, it's the users skill and experience with the problem they're asking the LLM to solve. They work better for problems the prompter knows well and poorly for problems the prompter doesn't really understand.
Try it yourself. Ask claude for something you don't really understand. Then learn that thing, get a fresh instance of claude and try again, this time it will work much better because your knowledge and experience will be naturally embedded in the prompt you write up.
I review most of the code I get LLMs to write and actually I think the main challenge is finding the right chunk size for each task you ask it to do.
As I use it more I gain more intuition about the kinds of problems it can handle on it's, vs those that I need to work on breaking down into smaller pieces before setting it loose.
Without research and planning agents are mostly very expensive and slow to get things done, if they even can. However with the right initial breakdown and specification of the work they are incredibly fast.
you are overestimating the skill of code review. Some people have very specific ways of writing code and solving problems which are not aligned what LLMs wrote, but doesn't mean it's wrong.
I know senior developers that are very radical on some nonsense patterns they think are much better than others. If they see code that don't follow them, they say it's trash.
Even so, you can guide the LLM to write the code as you like.
And you are wrong, it's a lot on how people write the prompt.
I'm relatively forgiving on bugs that I kind of expect to have happen... just from experience working with developers... a lot of the bugs I catch in LLMs are exactly the same as those I have seen from real people. The real difference is the turn around time. I can stay relatively busy just watching what the LLM is doing, while it's working... taking a moment to review more solidly when it's done on the task I gave it.
Sometimes, I'll give it recursive instructions... such as "these tests are correct, please re-run the test and correct the behavior until the tests work as expected." Usually more specific on the bugs, nature and how I think they should be fixed.
I do find that sometimes when dealing with UI effects, the agent will go down a bit of a rabbit hole... I wanted an image zoom control, and the agent kept trying to do it all with css scaling and the positioning was just broken.. eventually telling it to just use nested div's and scale an img element itself, using CSS positioning on the virtual dom for the positioning/overflow would be simpler, it actually did it.
I've seen similar issues where the agent will start changing a broken test, instead of understanding that the test is correct and the feature is broken... or tell my to change my API/instructions, when I WANT it to function a certain way, and it's the implementation that is wrong. It's kind of weird, like reasoning with a toddler sometimes.
> Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding
this makes me feel better about the amount of disdain I've been feeling about the output from these llms. sometimes it popsout exactly what I need but I can never count on it to not go offrails and require a lot of manual editing.
I think that entirely disregarding the fundamental operation of LLMs with dismissiveness is ungrounded. You are literally saying it isn’t a skill issue while pointing out a different skill issue.
It is absolutely, unequivocally, patently false to say that the input doesn’t affect the output, and if the input has impact, then it IS a skill.
I will still take a glance every once in a while to satisfy my curiosity, but I have moved past trying to review code. I was happy with the results frequently enough that I do not find it to be necessary anymore. In my experience, the best predictor is the target programming language. I fail to get much usable code in certain languages, but in certain others it is as if I wrote it myself every time. For those struggling to get good results, try a different programming language. You might be surprised.
I think that code review experience is a big driver of success with the llms, but my take away is somewhat different. If you’ve spent a lot of time reviewing other people’s code you realize the failures you see with llms are common failures full stop. Humans make them too.
I also think reviewable code, that is code specifically delivered in a manner that makes code review more straightforward was always valuable but now that the generation costs have lowered its relative value is much higher. So structuring your approach (including plans and prompts) to drive to easily reviewed code is a more valuable skill than before.
> complain they aren't getting great results without a lot of hand holding
This is what I don’t understand - why would I “complain” about “hand holding”? Why would I just create a Claude skill or analogue that tells the agent to conform to my preferences?
I’ve done this many times, and haven’t run into any major issues.
I thought I try to debunk your argument with a food example. I am not sure I succeeded though. Judge for yourself:
It's always easier to blame the ingredients and convince yourself that you have some sort of talent in how you cook that others don't.
In my experience the differences are mostly in how the dishes produced in the kitchen are tasted. Chefs who have experience tasting dishes critically are more likely to find problems immediately and complain they aren't getting great results without a lot of careful adjustments. And those who rarely or never tasted food from other cooks are invariably going to miss stuff and rate the dishes they get higher.
> It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.
Well, it's easily the simplest explanation, right?
Unfortunately it is impossible to ascertain what is what from what we read online. Everyone is different and use the tools in a different way. People also use different tools and do different things with them. Also each persons judgement can be wildly different like you are saying here.
We can't trust the measurements that companies post either because truth isn't their first goal.
Just use it or don't use it depending on how it works out imo. I personally find it marginally on the positive side for coding
It's also always easier to blame the LLM when the developer doesn't work with it right.
That seems to make sense. Any suggestions to improve this skill of reviewing code?
I think especially a number of us more junior programmers lack in this regard, and don't see a clear way of improving this skill beyond just using LLMs more and learning with time?
It's always easier to blame the model and convince yourself that you have some sort of talent in reviewing LLM's work that others don't.
In my experience the differences are mostly in how the code produced by LLM is prompted and what context is given to the agent. Developers who have experience delegating their work are more likely to prevent downstream problems from happening immediately and complain their colleagues cannot prompt as efficiently without a lot of hand holding. And those who rarely or never delegated their work are invariably going to miss crucial context details and rate the output they get lower.
That's what I meant, though. I didn't mean "I say the right words", I meant "I don't give them a sentence and walk away".
Garbage in, garbage out.
In my experience the differences are mostly between the chair and the keyboard.
I asked Codex to scrape a bunch of restaurant guides I like, and make me an iPhone app which shows those restaurants on a map color coded based on if they're open, closed or closing/opening soon.
I'd never built an iOS app before, but it took me less than 10 minutes of screen time to get this pushed onto my phone.
The app works, does exactly what I want it to do and meaningfully improves my life on a daily basis.
The "AI can't build anything useful" crowd consists entirely of fools and liars.
I dunno, I have extensive experience reviewing code, and I still review all the AI generated code I own, and I find nothing to complain about in the vast majority of cases. I think it is based on "holding it right."
For instance, I've commented before that I tend to decompose tasks intended for AI to a level where I already know the "shape" of the code in my head, as well as what the test cases should look like. So reviewing the generated code and tests for me is pretty quick because it's almost like reading a book I've already read before, and if something is wrong it jumps out quickly. And I find things jumping out more and more infrequently.
Note that decomposing tasks means I'm doing the design and architecture, which I still don't trust the AI to do... but over the years the scope of tasks has gone up from individual functions to entire modules.
In fact, I'm getting convinced vibe coding could work now, but it still requires a great deal of skill. You have to give it the right context and sophisticated validation mechanisms that help it self-correct as well as let you validate functionality very quickly with minimal looks at the code itself.