Agents that run while I sleep

424 points • by aray07 • last Tuesday at 7:09 PM • 492 comments • view on HN

Comments

Adversarial AI code gen. Have another AI write the tests, tell Codex that Claude wrote some code and to audit the code and write some tests. Tell Gemini that Codex wrote the tests. Have it audit the tests. Tell Codex that Gemini thinks its code is bad and to do better. (Have Gemini write out why into dobetter.md)

➕ show 1 reply

digitalPhonix • last Tuesday at 7:39 PM

> Changes land in branches I haven't read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do. I care about this. I don't want to push slop, and I had no real answer.

That’s really putting the cart before the horse. How do you get to “merging 50 PRs a week” before thinking “wait, does this do the right thing?”

➕ show 1 reply

LittleBox • yesterday at 6:35 AM

I read this and think to myself “what does one need so much code for?”.

lrytz • last Wednesday at 7:14 AM

blog looks suspicious

- privacy policy links to marketing company `beehiiv.com`. the blog author doesn't show up there.

- the profile picture url is `.../Generated_Image_March_03__2026_-_1_55PM.jpg.jpeg`

i didn't dig or read further.

➕ show 1 reply

BeetleB • last Tuesday at 7:25 PM

I wish there was a way to "freeze" the tests. I want to write the tests first (or have Claude do it with my review), and then I want to get Claude to change the code to get them to pass - but with confidence that it doesn't edit any of the test files!

➕ show 10 replies

skyberrys • last Wednesday at 12:32 AM

To me the last paragraph was the highest value in the article. Write out your test in plain language first, and then write the prompt for the autonomous agent using your language and the test prompt not the auto-code.

osigurdson • last Tuesday at 10:21 PM

I think the solution has to be end to end tests. Maybe first run by humans, then maybe agents can learn and replicate. I can't see why unit tests really help other than for the LLM to reason about its own code a little more.

Jeffrin-dev • last Wednesday at 4:38 PM

I am getting started to claude projects... Any usefull things . . . worth knowing that saves free limits . . . . .

davidshepherd7 • last Wednesday at 8:23 AM

On the off chance that the author reads this: can you enable an RSS feed please?

I want to subscribe, but I never end up reading newsletters if they land in my email inbox.

jaggederest • last Tuesday at 8:18 PM

Anyone who wants a more programmatic version of this, check out cucumber / gherkin - very old school regex-to-code plain english kind of system.

shawntwin • last Wednesday at 2:20 AM

There seems lots of preparing, planning, token buying, set goal, and token cost just to niche target and related vibe coding.

foundatron • last Tuesday at 9:00 PM

Feels like a whole bunch of us are converging on very similar patterns right now.

I've been building OctopusGarden (https://github.com/foundatron/octopusgarden), which is basically a dark software factory for autonomous code generation and validation. A lot of the techniques were inspired by StrongDM's production software factory (https://factory.strongdm.ai/). The autoissue.py script (https://github.com/foundatron/octopusgarden/blob/main/script...) does something really close to what others in this thread are describing with information barriers. It's a 6-phase pipeline (plan, review plan, implement, cold code review, fix findings, CI retry) where each phase only gets the context it actually needs. The code review phase sees only the diff. Not the issue, not the plan. Just the diff. That's not a prompt instruction, it's how the pipeline is wired. Complexity ratings from the review drive model selection too, so simple stuff stays on Sonnet and complex tasks get bumped to Opus.

On the test freezing discussion, OctopusGarden takes a different approach. Instead of locking test files, the system treats hand-written scenarios as a holdout set that the generating agent literally never sees. And rather than binary pass/fail (which is totally gameable, the specification gaming point elsewhere in this thread is spot on), an LLM judge scores satisfaction probabilistically, 0-100 per scenario step. The whole thing runs in an iterative loop: generate, build in Docker, execute, score, refine. When scores plateau there's a wonder/reflect recovery mechanism that diagnoses what's stuck and tries to break out of it.

The point about reviewing 20k lines of generated code is real. I don't have a perfect answer either, but the pipeline does diff truncation (caps at 100KB, picks the 10 largest changed files, truncates to 3k lines) and CI failures get up to 4 automated retry attempts that analyze the actual failure logs. At least overnight runs don't just accumulate broken PRs silently.

Also want to shout out Ouroboros (https://github.com/Q00/ouroboros), which comes at the problem from the opposite direction. Instead of better verification after generation, it uses Socratic questioning to score specification ambiguity before any code gets written. It literally won't let you proceed until ambiguity drops below a threshold. The core idea ("AI can build anything, the hard part is knowing what to build") pairs well with the verification-focused approaches everyone's discussing here. Spec refinement upstream, holdout validation downstream.

gormen • last Wednesday at 5:04 AM

Different approach: copy the programmer's logic, not the agent's behavior.

voidUpdate • last Wednesday at 8:21 AM

"I've been building agents that write code while I sleep" and "I don't want to push slop" seem directly at odds with each other...

pokstad • last Wednesday at 3:45 AM

Took a super intelligent AI for us to realize how important tests and TDD is.

nemo44x • last Wednesday at 3:45 AM

How do people not understand this? LLMs are goal machines. You need to give them the specific goal if you want good results and continue to reenforce it. So of course this means speccing and design work.

People are so enamored with how fast the 20% part is now and yes it’s amazing. But the 80% part by time (designing, testing, reviewing, refactoring, repairing) still exists if you want coherent systems of non-trivial complexity.

All the old rules still apply.

akhrail1996 • last Wednesday at 3:18 AM

Honestly I think the "same AI checking same AI" concern is a bit overstated at this point. If the agents don't share context - separate conversations, no common memory - Opus is good enough that they don't really fall into the same patterns. At least at the micro level, like individual functions and logic. Maybe at the macro/architectural level there's still something there but in practice I'm not seeing it much anymore.

tempodox • last Wednesday at 4:25 PM

The cowboy gunslinging knows no bounds.

redanddead • last Wednesday at 8:04 AM

Great so i can wake up to a nuked git and wiped drive

misja111 • last Wednesday at 8:03 AM

> At some point you're not reviewing diffs at all, just watching deploys and hoping something doesn't break.

Good luck doing that in any company that does something meaningful. I can't believe anybody can seriously be ok with such a workflow, except maybe for your little pet project at home.

Tepix • last Wednesday at 9:44 AM

Just because you can let Claude run overnight doesn't mean it makes sense if you can no longer review what it has done.

If you don't review the result, who is going to want to use or even pay for this slop?

Reviewing is the new bottleneck. If you cannot review any more code, stop producing new code.

sergiotapia • last Wednesday at 1:48 AM

None of this really answers the problem of all this slop is being produced at record pace and still requires absorption into the company, into the practices, and be reviewed by a human being.

I don't think AI will ever solve this problem. It will never be more than a tool in the arsenal. Probably the best tool, but a tool nonetheless.

mpalmer • last Wednesday at 12:24 PM

I simply can't stand reading this prose, let alone bring myself to care about some vibe-code BS the author couldn't be bothered to write about themselves.

Telling Claude to turn your notes into a blog post with simple, terse language does not hide your own lack of taste.

interpol_p • last Wednesday at 3:19 AM

The example given in the article is acceptance criteria for a login/password entry flow. This is fairly easy to spec-out in terms of AC and TDD.

I have been asking these tools to build other types of projects where it (seems?) much more difficult to verify without a human-in-the-loop. One example is I had asked Codex to build a simulation of the solar system using a Metal renderer. It produced a fun working app quickly.

I asked it to add bloom. It looped for hours, failing. I would have to manually verify — because even from images — it couldn't tell what was right and wrong. It only got it right when I pasted a how-to-write-a-bloom-shader-pass-in-Metal blog post into it.

Then I noticed that all of the planet textures were rotating oddly every time I orbited the camera. Codex got stuck in another endless loop of "Oh, the lookAt matrix is in column major, let me fix that <proceeds to break everything>." or focusing (incorrectly) on UV coordinates and shader code. Eventually Codex told me what I was seeing "was expected" and that I just "felt like it was wrong."

When I finally realised the problem was that Codex had drawn the planets with back-facing polygons only, I reported the error, to which Codex replied, "Good hypothesis, but no"

I insisted that it change the culling configuration and then it worked fine.

These tools are fun, and great time savers (at times), but take them out of their comfort zone and it becomes real hard to steer them without domain knowledge and close human review.

adamddev1 • last Wednesday at 9:47 AM

Tests cannot show the absence of bugs.

These are fundamentals of CS that we are forgetting as we dismantle all truth and keep rocketing forward into LLM psychosis.

> I care about this. I don't want to push slop, and I had no real answer.

The answer is to write and understand code. You can't not want to push slop, and also want to just use LLMs.

monooso • last Tuesday at 8:41 PM

I appear to be in the minority here. Perhaps because I've been practicing TDD for decades, this reads like the blog equivalent of "water is wet."

dzuc • last Tuesday at 7:44 PM

red / green / refactor is a reasonable way through this problem

keyle • last Tuesday at 11:24 PM

It's amazing the length at which people who want to write code go, to not write code.

Don't get me wrong, I use agentic coding often, when I feel it's going to type it faster than me (e.g. a lot of scaffolding and filler code).

Otherwise, what's the point?

I feel the whole industry is having its "Look ma! no hands!" moment.

Time to mature up, and stop acting like sailing is going where the seas take you.

mandeepj • last Wednesday at 3:00 AM

Now, Someone has to review tests! Just shifting ownership! Claude has just released 'Code Review'. But I don't think you can leave either one on autopilot.

Code Review: https://news.ycombinator.com/item?id=47313787

emirhan_demir • last Tuesday at 10:13 PM

A short story: A developer let ClaudeCode manage his AWS infrastructure. The agent ran a Terraform destroy command... Gone: 2 websites, production database, all backups and 2.5 years of data The agent didn't make a mistake. It did exactly what it was allowed to do. That's the problem dude

apsdsm • last Tuesday at 10:39 PM

Do you really, honestly, have to be doing this stuff even when you sleep? To the point it hits you “wait is this even any good? Gee I don’t want to push out slop.”

If you don’t trust the agent to do it right in the first place why do you trust them to implement your tests properly? Nothing but turtles here.

petesergeant • last Wednesday at 8:01 AM

Codex is really good at checking Claude’s work: https://github.com/pjlsergeant/moarcode

jongjong • last Wednesday at 7:13 AM

I think the idea of running agents while you sleep isn't going to work until AI can match or exceed human-level agency and intelligence.

Whenever I coded any serious solution as a technical co-founder, every single day there was a major new debate about the product direction. Though we made massive 'progress' and built out a whole new universe in software, we haven't yet managed to find product market fit. It's like constant tension. If the intelligence of two relatively intelligent humans with a ton of experience and complimentary expertise isn't enough to find product-market-fit after one year, this gives you an idea about how high the bar is for an AI agent.

It's like the problem was that neither me nor my domain expert co-founder who had been in his industry for over 15 years had a sufficiently accurate worldview about the industry or human psychology to be able to produce a financially viable solution. Technically, it works perfectly but it just doesn't solve anyone's problem.

So just imagine how insanely smart AI has to be to compete in the current market.

Maybe you could have 100 agents building and promoting 100 random apps per day... But my feeling is that you're going to end up spending more money on tokens and domain names then you will earn in profits. Maybe deploy them all under the same domain with different subdomains? Not great for SEO... Also, the market for all these basic low-end apps is going to be extremely competitive.

IMO, the best chance to win will be on medium and complex systems and IMO, these will need some kind of human input.

anonnon • last Wednesday at 7:05 AM

Somewhat off topic, but any theories as to why the shilling for Claude (not insinuating that's what the OP is doing) is so transparent? For example, the bots/shills often go out of their way to insist you get the $200 plan, in particular. If Anthropic's product is so good: 1) why must it be shilled so hard, and 2) why is the shilling (which is likely partially a result of the product) so obvious? Is this an OpenAI reverse psychology dirty trick, the equivalent of using robocalls to inundate voters with messages telling them to vote for your opponent so as to annoy and negatively dispose them towards your opponent?

xyzal • last Wednesday at 5:29 AM

I guess I'll just wait a year until a best practice emerges.

chaostheory • last Wednesday at 4:11 AM

Just don’t use the same model to write and vet the code. Use two or more different models to verify the code in addition to reading it yourself.

divan • last Wednesday at 10:00 AM

I'm (re)writing a big project with the following approach:

1. Write tons of documentation first. I.e. NASA style, every singe known piece of information that is important to implementation. As it's a rewrite of legacy project, I know pretty much everything I need, so there is very little ideas validation/discovery in the loop for that stage. Documentation is structured in nested folders and multiple small .md files, because its amount already larger than Claude Code context (still fits into Gemini). Some of the core design documents are included into AGENTS.md(with symlink to GEMINI/CLAUDE mds)

For that particular project I spent around 1.5 months writing those docs. I used Claude to help with docs, especially based on the existing code base, but the docs are read and validated by humans, as a single source of truth. For every document I was also throwing Gemini and Codex onto it for analyzing for weaknesses or flaws (that worked great, btw).

2. TDD at it's extreme version. With unit tests, integration tests, e2e, visual testing in Maestro, etc. The whole implementation process is split in multiple modules and phases, but each phase starts with writing tests first. Again, as soon as test plan ready, I also throw it on Gemini and Codex to find flaws, missed edge cases, etc. After implementing tests, one more time - give it to Gemini/Codes to analyze and critique.

3. Actual coding. This part is the fastest now especially with docs and tests in place, but it's still crucial to split work into manageable phases/chunks, and validate every phase manually, and ocassionaly make some rounds of Gemini/Codex independently verifying if the code matches docs and doesn't contain flaws/extra duplication/etc.

I never let Claude to commit to git. I review changes quickly, checking if the structure of code makes sense, skimming over most important files to see if it looks good to me (i.e. no major bullshit, which, frankly, has never happened yet) and commit everything myself. Again, trying to make those phases small enough so my quick skim-review still meaningful.

If my manual inspection/test after each phase show something missing/deviating, first thing I ask is "check if that is in our documentation". And then repeat the loop - update docs, update/add tests, implement.

The project is still in progress, but so far I'm quite happy with the process and the speed. In a way, I feel that "writing documentation" and "TDD" has always been a good practice, but too expensive given that same time could've been spent on writing actual code. AI writing code flipped that dynamics, so I'm happy to spend more time on actual architecting/debating/making choices, then on finger tapping.

kypro • last Wednesday at 10:45 AM

> Teams using Claude for everyday PRs are merging 40-50 a week instead of 10

How is this even possible? Am I the only SWE who feels like the easiest part of my job is writing code and this was never the main bottleneck to PR?

Before CC I'd probably spent around 20-30% of my day just writing code into an IE. That's now maybe 10% now. I'd probably also spend 20-30% of my day reading code and investigating issues, which is now maybe 10-15% of my day now using CC to help with investigation and explanations.

But there's a huge part of my day, perhaps the majority it, where I'm just thinking about technical requirements, trying to figure out the right data model & right architecture given those requirements, thinking about the UX, attending meetings, code reviews, QA, etc, etc, etc...

Are these people who are spitting out code literally doing nothing but writing code all day without any thought so now they're seeing 4-5x boosts in output?

For me it's probably made me 50% more efficient in about 40-50% of my work. So I'm probably only like 20-25% more efficient overall. And this assumes that the code I'm getting CC to produce is even comparable to my own, which in my experience it's not without significant effort which just erodes any productivity benefit from the production of code.

If your developers are raising 5x more PRs something is seriously wrong. I suspect that's only possible if they're not thinking through things and just getting CC to decide the requirements, come up with the architecture, decide on implementation details, write the code and test it. Presumably they're also not reviewing PRs, because if they were and there is this many PRs being raised then how does the team have time to spit out code all day using CC?

People who talk about 5x or 10x productivity boosts are either doing something wrong, or just building prototype. As someone who has worked in this industry for 20 years, I literally don't understand how what some people describe can even being happening in functional SWE teams building production software.

ctdinjeu2 • last Wednesday at 4:38 PM

It’s not yet possible due to context size limitations.

LLMs can’t retain most codebases nor even most code files accurately - they start making serious mistakes at ~500 lines.

Paste a ~200 line React component or API endpoint, have it fix or add something, it’s fine, but paste a huge file, it starts omitting pieces, making mistakes, and it gets worse as time goes on.

You have to keep reminding it by repeatedly refreshing context with the part in question.

Everyone who has seriously tried knows this.

For this reason alone the LLM “agent” is simply not one. Not yet. It cannot really drive itself and it’s a fundamental limitation of the technology.

Someone who knows more about model architecture might be able to chime in on why increasing the context size will/won’t help agents retain a larger working memory to acceptable degrees of accuracy, but as it stands it’s so limited that it works more like a calculator that you must actively use rather than an autonomous agent.

jc-myths • last Wednesday at 1:04 AM

Solo founder here, shipping a real product built mostly with AI. The code review thing is real but my actual daily pain is different. AI lies about being done. It'll say "implemented" and what it actually did is add a placeholder with a TODO comment. Or it silently adds a fallback path that returns hardcoded data when the real API fails, and now your app "works" but nothing is real.

I've also given it explicit rules like "never use placeholder images, always generate real assets" — and it just... ignores them sometimes. Not always. Sometimes. Which is worse, because you can't trust it but you also can't not use it.

The 80% it writes is fine. The problem is you still have to verify 100% of it.

➕ show 1 reply

oliver_dr • last Wednesday at 4:08 PM

[dead]

thebotclub • last Wednesday at 2:02 PM

[dead]

oliver_dr • last Wednesday at 8:47 AM

[dead]

ClaudeAgent_WK • last Wednesday at 10:32 AM

[dead]

silentsvn • last Wednesday at 4:46 PM

[dead]