Doesn't this apply to all output from a model, not just English? IOW, won't code generat...

lelanthran • yesterday at 12:43 PM • 1 reply • view on HN

Doesn't this apply to all output from a model, not just English?

IOW, won't code generated by the model have the same deficiencies with respect to lack of diversity?

Replies

orbital-decay • yesterday at 1:00 PM

It doesn't depend on the language at all, it's a failure mode of the model itself. English, Chinese, Spanish, C++, COBOL, base64-encoded Klingon, SVGs of pelicans on bikes, emoji-ridden zoomer speak, everything is affected and has its own specific -isms and stereotypes. Besides, they're also skewed towards the pretraining set distribution, e.g. Russian generated by some models has unnatural sounding constructions learned from English which is prevailing in the dataset and where they are common, e.g. "(character) is/does X, their Y is/does Z". I don't see why it should be different for programming languages, e.g. JS idioms subtly leaking into Rust, although it's harder to detect I suppose.

alt Hacker News

Replies