You may be able to effectively externalize taste by "hot or not" style pair testing. Enough comparisons and I'd expect ML to be able to mimic human taste by latching on to features we're not well aware of influencing us.
Wouldn't this style of training suffer from the AI learning things the user didn't intend? I may thumbs down something for a specific detail I don't like, while other things in it are great. Certain traits that tend to occur together go along for the ride. We see similar things happen in natural selection, where mates may be chosen for 1 specific feature, and other less desirable things come along for the ride.
Outside of AI, I run into this issue when taking basic personality tests. A question may be written for a specific reason, which influences the results, but the reason for my answer may be completely unrelated to the reason intended by the person who made the test.
This is RL, right? Like, this is exactly why models have mostly converged around obvious style, because we train them literally on thumbs-up/thumbs-down data of what good behavior and good code looks like.
And that's why it's so hard to get a model to reproduce the specific taste of a person or an organization. My taste is different than yours, so if we dump our aggregate preferences into RL, in averages out to nothing interesting.
For the code-writing case, this means you end up reviewing every line of code, looking for places where you'd thumbs-down the code. Not every line of code contains a real decision, though, so it feels like a waste of time.