It doesn't depend on the language at all, it's a failure mode of the model itself. English, Chinese, Spanish, C++, COBOL, base64-encoded Klingon, SVGs of pelicans on bikes, emoji-ridden zoomer speak, everything is affected and has its own specific -isms and stereotypes. Besides, they're also skewed towards the pretraining set distribution, e.g. Russian generated by some models has unnatural sounding constructions learned from English which is prevailing in the dataset and where they are common, e.g. "(character) is/does X, their Y is/does Z". I don't see why it should be different for programming languages, e.g. JS idioms subtly leaking into Rust, although it's harder to detect I suppose.
It doesn't depend on the language at all, it's a failure mode of the model itself. English, Chinese, Spanish, C++, COBOL, base64-encoded Klingon, SVGs of pelicans on bikes, emoji-ridden zoomer speak, everything is affected and has its own specific -isms and stereotypes. Besides, they're also skewed towards the pretraining set distribution, e.g. Russian generated by some models has unnatural sounding constructions learned from English which is prevailing in the dataset and where they are common, e.g. "(character) is/does X, their Y is/does Z". I don't see why it should be different for programming languages, e.g. JS idioms subtly leaking into Rust, although it's harder to detect I suppose.