The reason it works is because there's a difference between the model knowing something and the agent doing something. Claude will happily write giant untested functions even though it "knows" that short functions are easier to understand and then testing enables safe refactoring etc. The model also "knows" many conflicting "facts", such as the fact that testing is smart and that testing is a waste of time. It can't act on both beliefs at the same time. That's why nudging it toward your own preferred behaviors works.
Isn't all of what you described what post-training/RLHF is supposed to do? The internet is full of racism, so if you're just predicting the next token based on training data, you'll get racism (eg. Microsoft Tay), but that's more or less solved by AI companies now.
It lacks a critical self, but the weights are there for in context learning and nudging behaviour. It's goal is to complete whatever task is given. You need to make sure the outcome you want is clearly defined.
You don't need elaborate prompts, just a few lines
"All code must have corresponding tests written ahead of time to prove the code meets the specification" is sufficient for most use cases. Prose can help nudge it more if it isn't adhearing consistently.