No developer writes the same prompt twice. How can you be sure something has changed?

cactusplant7374 • yesterday at 11:03 AM • 4 replies • view on HN

Replies

I regularly run the same prompts twice and through different models. Particularly, when making changes to agent metadata like agent files or skills.

At least weekly I run a set of prompts to compare codex/claude against each other. This is quite easy the prompt sessions are just text files that are saved.

The problem is doing it enough for statistical significance and judging the output as better or not.

andreagrandi • yesterday at 1:23 PM

I suspect you may not be writing code regularly... If I have to ask Claude the same things three times and it keeps saying "You are right, now I've implemented it!" and the code is still missing 1 out of 3 things or worse, then I can definitely say the model has become worse (since this wasn't happening before).

➕ show 2 replies

SkyPuncher • yesterday at 1:28 PM

When I use Claude daily (both professionally and personally with a Max subscription), there are things that it does differently between 4.5 and 4.6. It's hard to point to any single conversation, but in aggregate I'm finding that certain tasks don't go as smoothly as they used to. In my view, Opus 4.6 is a lot better at long standing conversations (which has value), but does worse with critical details within smaller conversations.

A few things I've noticed:

* 4.6 doesn't look at certain files that it use to

* 4.6 tends to jump into writing code before it's fully understood the problem (annoying but promptable)

* 4.6 is less likely to do research, write to artifacts, or make external tool calls unless you specifically ask it to

* 4.6 is much more likely to ask annoying (blocking) questions that it can reasonably figure out on it's own

* 4.6 is much more likely to miss a critical detail in a planning document after being explicitly told to plan for that detail

* 4.6 needs to more proactively write its memories to file within a conversation to avoid going off track

* 4.6 is a lot worse about demonstrating critical details. I'm so tired of it explaining something conceptually without it thinking about how it implements details.

➕ show 1 reply

baq • yesterday at 12:32 PM

Ralph Wiggum would like a word

➕ show 1 reply

alt Hacker News

Replies