> They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.
"Plausible" sounds like the right word to me. (It would be a mistake to digress into these features of LLMs in an article where it isn't needed.)
I agree - I took "plausible" here to mean plausible-looking, no different than similar-looking.
The trouble of course is that similar/plausible isn't good enough unless the LLM has seen enough similar-but-different training samples to refine it's notion of similarity to the point where it captures the differences that are critical in a given case.
I'd rather just characterize it as a lack of reasoning, since "add more data" can't be the solution to a world full of infinite variety. You can keep playing whack a mole to add more data to fix each failure, and I suppose it's an interesting experiment to see how far that will get you, but in the end the LLM is always going to be brittle and susceptible to stupid failure cases if it doesn't have the reasoning capability to fully analyze problems it was not trained on.