This is why I Blind A/B test everything.
I burn a ton of tokens, but things actually have to prove their value. And the vast majority of things do not come close to doing so.
I have my own AI agent full of stuff. I blind A/B test everything, but I also don't think the results are all that useful as a signal to others.
Just because I Blind A/B test it 4 months ago, it's maybe not meaningful today.
Maybe the word choices I use dramatically impact things.
I do it, because I can prove the value, and see it with my own eyes. I don't even bother publishing the specific Blind A/B tests.
Also, I've seen other people try to Blind A/B test and get it very wrong. If your measurements aren't good, the test is meaningless.
I don't know. We're all working on these problems together. There's a lot of black magic (which is why I rely on hooks a lot). I'm sure I have tons of black magic, I have a large little AI Agent.
But what I know for certain, is it works for me. All it takes is for me to not use it, and I honestly don't know how everyone currently works with AI.
I will link it, but it is not an endorsement for what you do. Mostly only other software engineers use it. And it's so very specific to the things I have to do.
At best, maybe it sparks an idea for you to implement on your own.